Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

2411.02265

Xingwu Sun et el.

🤗 2024-11-05

↗ arXiv ↗ Hugging Face ↗ Papers with Code

TL;DR
#

Large Language Models (LLMs) are rapidly evolving, but most open-source models use dense architectures, limiting their efficiency and scalability. Mixture-of-Experts (MoE) models offer an alternative, distributing computation across specialized submodels, but often lack scale and robust training methods. This research addresses these issues by introducing Hunyuan-Large.

Hunyuan-Large is a massive open-source MoE model exceeding other open-source LLMs in size and performance across various benchmarks. Its success is attributed to several key innovations, including extensive synthetic training data, a mixed-expert routing strategy, and techniques to improve efficiency. The paper also investigates the scaling laws of MoE models, providing valuable guidance for future model development and optimization.

Key Takeaways
#

Why does it matter?
#

This paper is crucial because it presents Hunyuan-Large, a significant advancement in open-source large language models (LLMs). Its massive scale and innovative MoE architecture address limitations of existing models, opening avenues for research on efficient scaling and improved performance. The release of the model’s code and checkpoints directly benefits the research community, accelerating progress in LLM development and application. This work also offers insights into the scaling laws of MoE models, guiding future development.

Visual Insights
#

🔼 This figure illustrates the four-step data synthesis process used in the pre-training of Hunyuan-Large. First, instructions are generated using various sources like web pages and books. Second, these instructions are evolved by refining them, expanding low-resource domains, and increasing the difficulty level. Third, responses to these evolved instructions are generated by specialized models. Finally, the generated instruction-response pairs are filtered to ensure high quality and consistency, removing low-quality or inconsistent data. This process is crucial for creating high-quality and diverse training data for the model.
read the caption
Figure 1: The four-step process of data synthesis in Hunyuan-Large’s pre-training: (1) Instruction generation, (2) Instruction evolution, (3) Response generation, and (4) Response filtering.

Configuration	Hunyuan-Large
# Layers	64
# Attention Heads	80
# Key/Value Heads	8
# Shared Experts	1
# Specialized Experts	16
# Activated Specialized Experts	1
# Trained Tokens	7T
Activation Function	SwiGLU
Vocabulary Size	128K
Hidden Size	6,400

🔼 Table 1 presents a detailed breakdown of the Hunyuan-Large model’s architecture and key hyperparameters. It highlights the model’s impressive scale, with 389 billion total parameters and 52 billion activated parameters. The table clarifies the model’s structure, specifying the number of layers, attention heads, key/value heads, and the unique configuration of experts (1 shared and 1 specialized expert activated per token). This level of detail is crucial for understanding the model’s complexity and resource requirements.
read the caption
Table 1: Overview of the architecture and key hyper-parameters of Hunyuan-Large. This model has 389B total parameters and 52B activated parameters. There are 1 shared expert and 1 specialized expert activated for each token.

In-depth insights
#

MoE Model Scaling
#

The research explores Mixture-of-Experts (MoE) model scaling, focusing on the relationship between model size, training data, and performance. They investigate scaling laws, revealing that optimal performance is achieved with a specific balance between activated parameters and training data. The study highlights the importance of high-quality synthetic data, significantly exceeding previous literature, for effective MoE training. Furthermore, they introduce and analyze the efficiency gains from strategies such as mixed expert routing, KV cache compression, and expert-specific learning rates, demonstrating practical techniques to optimize MoE model training and deployment. These findings offer valuable insights for future MoE model development and optimization, guiding researchers toward more efficient and powerful large language models.

Synthetic Data Power
#

The research paper does not have a specific heading titled ‘Synthetic Data Power’. However, the paper extensively discusses the crucial role of high-quality synthetic data in training the Hunyuan-Large model. A significant portion of the training data (1.5T tokens out of 7T) consists of synthetic data, generated through a four-step process including generation, evolution, response generation, and filtering. This approach improves data quality and diversity, enabling the model to learn richer representations and generalize better to unseen data. The use of synthetic data is highlighted as a key innovation differentiating Hunyuan-Large from previous models, particularly in its massive scale and focus on diverse, educational fields like mathematics and coding. The effectiveness of this synthetic data strategy is supported by the model’s superior performance on various benchmarks.

KV Cache Efficiency
#

To address the memory constraints and computational costs associated with key-value (KV) caching in large language models (LLMs), especially those with Mixture-of-Experts (MoE) architectures, the authors implemented two crucial compression strategies: Grouped-Query Attention (GQA) and Cross-Layer Attention (CLA). GQA groups KV heads, reducing the overall cache size. CLA shares the KV cache across adjacent layers, further enhancing efficiency. This combined approach resulted in a remarkable 95% reduction in total KV cache memory compared to the standard multi-head attention mechanism. This optimization significantly improved inference speed without significantly impacting the model’s performance, demonstrating the effectiveness of their combined strategy for efficient and scalable LLM deployment.

Post-Training Methods
#

The research paper’s “Post-Training Methods” section details techniques to enhance the pre-trained Hunyuan-Large model. Supervised Fine-Tuning (SFT) refines the model using high-quality instruction data encompassing diverse tasks like mathematical problem-solving and code generation. This process focuses on data collection, balancing instruction types, and quality control through rule-based and model-based filtering, alongside human review. Reinforcement Learning from Human Feedback (RLHF) further improves the model using a single-stage training strategy combining offline and online methods. This involves utilizing a pre-compiled preference dataset and a reward model to select and optimize responses, preventing issues like reward hacking. The combination of SFT and RLHF is designed to align the model better with human preferences while enhancing its performance and addressing practical application needs.

Long-Context Limits
#

The provided text does not contain a heading specifically titled ‘Long-Context Limits’. However, sections discussing the model’s ability to handle long sequences of text are present. Hunyuan-Large is demonstrated to successfully process sequences up to 256K tokens, showcasing significant advancements in long-context capabilities. This is achieved through a combination of strategies including the use of Rotary Position Embeddings (RoPE) and scaling of the RoPE base frequency, which enhances the model’s ability to manage long-range dependencies within the text. The paper also reports experimental results on benchmarks designed to assess long-context understanding, such as RULER and LV-Eval. While the exact limits aren’t explicitly defined as a ‘Long-Context Limit’, the results across various benchmarks show that performance does not significantly degrade even with very long input sequences, suggesting that the model effectively handles long-range dependencies. The introduction of a custom dataset, PenguinScrolls, further tests the model’s limits within realistic long-context scenarios. Overall, the paper strongly suggests that Hunyuan-Large pushes the boundaries of current long-context processing capabilities of large language models.

More visual insights
#

More on tables

Attention Mechanism	KV Cache Memory
MHA	4n_hd_hl
GQA	4n_gd_hl
MQA	4d_hl
CLA	2n_hd_hl
GQA+CLA	2n_gd_hl

🔼 This table compares the memory usage (in bytes, using bf16 precision) of different attention mechanisms used in Transformer models. The comparison includes Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), and Cross-Layer Attention (CLA). It also shows the combined effect of GQA and CLA, which is used in the Hunyuan-Large model. The table shows how the memory usage scales with the number of attention heads (nh), dimension per head (dh), number of layers (l), and number of groups in GQA (ng, where ng < nh). Cross-Layer Attention (CLA) is implemented by sharing the KV cache every 2 layers. The table helps illustrate the memory savings achieved by using GQA+CLA in Hunyuan-Large compared to traditional MHA.
read the caption
Table 2: Comparisons of KV cache memory (in bytes on bf16) for different attention mechanisms. The attention mechanisms include Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), Cross-Layer Attention (CLA), and GQA+CLA (the final setting in Hunyuan-Large). nhsubscript𝑛ℎn_{h}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, dhsubscript𝑑ℎd_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, l𝑙litalic_l, and ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT represent the number of attention heads, the dimension per head, the number of layers, and the number of groups in GQA (ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT

Model	LLama3.1-405B	LLama3.1-70B	Mixtral-8x22B	DeepSeek-V2	Hunyuan-Large
Architecture	Dense	Dense	MoE	MoE	MoE
# Activated Params	405B	70B	39B	21B	52B
# Total Params	405B	70B	141B	236B	389B
Context Length	128k	128k	64k	128k	256k
	English
MMLU	85.2	79.3	77.8	78.5	88.4
MMLU-Pro	61.6	53.8	49.5	-	60.2
BBH	85.9	81.6	78.9	78.9	86.3
HellaSwag	-	-	88.7	87.8	86.8
CommonsenseQA	85.8	84.1	82.4	-	92.9
WinoGrande	86.7	85.3	85.0	84.9	88.7
PIQA	-	-	83.6	83.7	88.3
NaturalQuestions	-	-	39.6	38.7	52.8
DROP	84.8	79.6	80.4	80.1	88.9
ARC-C	96.1	92.9	91.2	92.4	95.0
TriviaQA	-	-	82.1	79.9	89.2
	Chinese
CMMLU	-	-	60.0	84.0	90.2
C-Eval	-	-	59.6	81.7	91.9
C3	-	-	71.4	77.4	82.3
	Math
GSM8K	89.0	83.7	83.7	79.2	92.8
MATH	53.8	41.4	42.5	43.6	69.8
CMATH	-	-	72.3	78.7	91.3
	Code
HumanEval	61.0	58.5	53.1	48.8	71.4
MBPP	73.4	68.6	64.2	66.6	72.6

🔼 This table compares the performance of Tencent’s Hunyuan-Large pre-trained model against several other leading large language models (LLMs), including LLaMA3.1-70B, Mistral-8x22B, and DeepSeek-V2. The comparison encompasses various benchmark tasks across English and Chinese languages, covering areas like language understanding, reasoning, math, coding, and commonsense. Key metrics are presented for each model to demonstrate the relative strengths and weaknesses of each LLM on different task types. The table highlights Hunyuan-Large’s performance relative to other models of similar scale, and also includes information on the number of parameters and activated parameters for each model. The context length each model can handle is also included.
read the caption
Table 3: Performance of Hunyuan-Large’s pre-trained model and its competitors.

Model	LLAMA 3.1 405B Inst.	LLAMA 3.1 70B Inst.	Mixtral 8x22B Inst.	DeepSeek V2.5 Chat	Hunyuan-Large Inst.
MMLU	87.3	83.6	77.8	80.4	89.9
CMMLU	-	-	61.0	-	90.4
C-Eval	-	-	60.0	-	88.6
BBH	-	-	78.4	84.3	89.5
ARC-C	96.9	94.8	90.0	-	94.6
GPQA_diamond	51.1	46.7	-	-	42.4
MATH	73.8	68.0	49.8	74.7	77.4
HumanEval	89.0	80.5	75.0	89.0	90.0
AlignBench	6.0	5.9	6.2	8.0	8.3
MT-Bench	9.1	8.8	8.1	9.0	9.4
IFEval strict-prompt	86.0	83.6	71.2	-	85.0
Arena-Hard	69.3	55.7	-	76.2	81.8
AlpacaEval-2.0	39.3	34.3	30.9	50.5	51.8

🔼 This table presents a comparison of the performance of Tencent’s Hunyuan-Large-Instruct model against several other leading large language models (LLMs) across a range of benchmark tasks. The benchmarks cover diverse areas such as commonsense reasoning, knowledge-based question answering, mathematical problem-solving, coding ability, and more. The table allows readers to quickly evaluate the relative strengths and weaknesses of Hunyuan-Large-Instruct compared to its competitors, including models like LLaMA and Mixtral, by showing each model’s performance score on each benchmark task. The inclusion of different instruction types for some benchmarks further enriches the comparative analysis.
read the caption
Table 4: Performance of our Hunyuan-Large-Instruct and its competitors.

Model	0-8K	8K-32K	32K-64K	64K-128K	0-32K	32K-64K	64K-128K
LLama3.1-70B-Instruct	95.89	95.39	94.72	86.48	75.73	62.39	61.57
Hunyuan-Large-Instruct	94.39	94.94	93.02	89.53	81.92	71.15	67.87

🔼 This table presents the performance comparison of the Hunyuan-Large-Instruct model and the LLama3.1-70B-Instruct model on two long-context benchmarks: RULER and LV-Eval. RULER assesses performance across various tasks and context lengths, while LV-Eval focuses on question-answering with varying complexities and context lengths. The table shows the accuracy scores for each model across different context length ranges (0-8K, 8K-32K, 32K-64K, 64K-128K tokens). This allows for an analysis of how the model’s performance changes as the context length increases.
read the caption
Table 5: The performance of Hunyuan-Large-Instruct on RULER and LV-Eval.

Model	Information Extraction	Information Localization	Qualitative Analysis	Numerical Reasoning	Overall
LLama3.1-70B-Instruct	82.51	69.70	75.77	49.52	69.37
Hunyuan-Large-Instruct	91.14	89.56	92.78	67.46	85.23

🔼 This table presents the performance of the Hunyuan-Large-Instruct model on the PenguinScrolls benchmark. PenguinScrolls is a newly introduced, in-house long-context benchmark designed to evaluate LLMs’ performance on real-world, diverse long-form text data. The table shows the model’s performance across four key sub-tasks within the benchmark: Information Extraction, Information Localization, Qualitative Analysis, and Numerical Reasoning. The results are presented to show the model’s capabilities and efficiency in handling extended text inputs across different task types.
read the caption
Table 6: The performance of Hunyuan-Large-Instruct on PenguinScrolls.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

MoE Model Scaling#

Synthetic Data Power#

KV Cache Efficiency#

Post-Training Methods#

Long-Context Limits#

More visual insights#

Full paper#