β arXiv β Hugging Face β Papers with Code
TL;DR#
Low-precision training, especially using floating-point quantization, is crucial for efficient large language model (LLM) training. However, existing scaling laws primarily focus on integer quantization, which isn’t well-suited for the nuances of floating-point methods. This lack of understanding hinders efforts to optimize training costs and predict model performance. This paper addresses these issues by deeply investigating the effects of different floating-point configurations (exponent bits, mantissa bits, scaling factor granularity) on LLM training performance. The research uses a comprehensive experimental setup involving various data and model sizes, along with multiple precision settings, to establish a robust and accurate scaling law for predicting performance under low-precision training. This new scaling law incorporates these crucial floating-point parameters, unlike prior work that treated precision in a less nuanced way.
The core contribution of this work is the development of a novel, unified scaling law that accurately predicts LLM performance under various data sizes, model sizes, and floating-point configurations (exponent and mantissa bits). This law allows researchers to efficiently select optimal parameter settings before running costly experiments and assists in predicting model behavior across a wide range of conditions. Key insights from the scaling law include discovering the optimal exponent-mantissa bit ratio for various precision levels and determining the critical training data size to prevent performance degradation. This research demonstrates that cost-effective performance can be achieved between 4-8 bits of precision and proposes guidelines for selecting optimal hardware configurations.
Key Takeaways#
Why does it matter?#
This paper is crucial for researchers working on large language model (LLM) training efficiency and low-precision computation. It provides a novel scaling law for floating-point quantization, offering practical guidance for optimizing LLM training costs and hardware resource allocation. The findings are particularly relevant to current trends in reducing computational expenses and improving LLM deployment on resource-constrained platforms. It opens up new avenues for exploring optimal exponent-mantissa bit ratios and critical data size thresholds in low-precision LLM training.
Visual Insights#
πΌ This figure compares the predictions of Kumar et al.’s (2024) scaling law (Equation 7 in the paper) against actual experimental results for various data sizes (D), exponent bits (E), and mantissa bits (M) during floating-point quantization training. The three subplots show these comparisons, with point sizes in each plot visually representing the magnitude of D, E, and M respectively. The results show that the scaling law significantly deviates from the observed experimental results particularly when both exponent and mantissa bits are small (E1M1), highlighting the limitations of using this model for this scenario. The plot demonstrates the inaccuracies of Kumar et al.’s scaling law in predicting the loss for floating-point quantization training.
read the caption
Figure 1: The fitting results of the scaling law in Eq. (7) deriving from Kumar etΒ al. (2024), which have large bias in E1M1 case. In the three sub-figures on the left, middle and right, the sizes of the data points are approximately proportional to Dπ·Ditalic_D, EπΈEitalic_E, and MπMitalic_M respectively.
Hyper-parameters | 41M | 85M | 154M | 679M | 1.2B | |
---|---|---|---|---|---|---|
Layers | 12 | 12 | 12 | 24 | 24 | |
Hidden Size | 512 | 768 | 1024 | 1536 | 2048 | |
FFN Hidden Size | 1536 | 2048 | 2816 | 4096 | 5632 | |
Attention Heads | 8 | 12 | 16 | 24 | 32 | |
Attention Head size | 64 | 64 | 64 | 64 | 64 | |
Optimizer | AdamW | AdamW | AdamW | AdamW | AdamW | |
Adam $(\beta_1,\beta_2)$ | (0.9, 0.95) | (0.9, 0.95) | (0.9, 0.95) | (0.9, 0.95) | (0.9, 0.95) | |
Adam $\epsilon$ | $1\times 10^{-8}$ | $1\times 10^{-8}$ | $1\times 10^{-8}$ | $1\times 10^{-8}$ | $1\times 10^{-8}$ | |
Weight Decay | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | |
Clip Grad Norm | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | |
Max LR | $3.0\times 10^{-4}$ | $3.0\times 10^{-4}$ | $3.0\times 10^{-4}$ | $3.0\times 10^{-4}$ | $3.0\times 10^{-4}$ | |
Min LR | 0 | 0 | 0 | 0 | 0 | |
LR Decay | Cosine | Cosine | Cosine | Cosine | Cosine | |
Seqence Length | 2048 | 2048 | 2048 | 2048 | 2048 | |
Batch Size (# Tokens) | 2M | 2M | 2M | 2M | 2M | |
Warmup Steps | 500 | 500 | 500 | 500 | 500 |
πΌ This table details the specific hyperparameter settings used for training each of the different sized LLaMA language models. It includes information such as the number of layers, hidden size, feed-forward network (FFN) hidden size, attention heads, attention head size, optimizer, Adam parameters (B1, B2), Adam epsilon, weight decay, gradient clipping norm, maximum learning rate, minimum learning rate, learning rate decay schedule, sequence length, batch size (in tokens), and warmup steps. These parameters are crucial for understanding the experimental setup and for reproducibility.
read the caption
Table 1: Model hyper-parameters for each size.
In-depth insights#
FP Quant. Scaling Laws#
The research explores scaling laws for floating-point quantization training in large language models (LLMs). It challenges existing laws that primarily focus on integer quantization, arguing they don’t accurately capture the nuances of floating-point precision. The study delves into the impact of exponent and mantissa bit allocation on LLM performance, proposing an optimal exponent-mantissa bit ratio for different precision levels. A crucial finding is the identification of a critical data size, beyond which adding more training data hinders performance. The research culminates in a unified scaling law that incorporates data size, model size, exponent, mantissa, and block size of scaling factors, providing a more comprehensive predictive model for low-precision LLM training and guiding cost-effective choices of precision levels for specific computational resources.
Optimal Bit Allocation#
Optimal bit allocation in quantized neural networks, especially for large language models (LLMs), is crucial for balancing model accuracy and computational efficiency. Finding the ideal balance between exponent bits (representing the dynamic range) and mantissa bits (representing precision within that range) is key. A common approach is to explore the scaling laws, which describe the relationship between model performance and different hyperparameters, including bit precision. The research investigates how the choice of exponent and mantissa bits affects LLM performance, aiming to find the optimal allocation for a given total number of bits. This involves extensive experimentation, fitting the results to scaling laws, and analyzing the resulting trade-offs. The optimal allocation often varies based on factors like the model size, dataset size, and the chosen quantization method. Further exploration might consider the impact of hardware limitations and the cost-performance trade-offs associated with different bit allocation strategies. Ultimately, the goal is to minimize the loss in accuracy while maximizing computational efficiency, resulting in a cost-effective solution for low-precision training and inference.
Critical Data Size#
The concept of “Critical Data Size” in the context of low-precision floating-point quantization training for LLMs reveals a crucial limitation. Beyond a certain data size, increasing training data paradoxically leads to performance degradation instead of improvement. This is attributed to the combined effects of limited precision and the “knowledge intensity” of the model. The model’s capacity to effectively utilize and learn from additional information is overwhelmed by the precision constraints. This highlights that optimal performance is not solely determined by the scale of data, but by a careful balance between data size, model size, and the selected precision. Optimal data size varies significantly depending on the precision level, with higher precisions enabling the use of larger datasets before encountering performance decline. This insight has significant implications for resource allocation and efficient training strategies, emphasizing the importance of precise scaling law estimations that account for the interplay between these factors.
Cost-Optimal Precision#
The concept of “Cost-Optimal Precision” in the context of large language model (LLM) training centers on finding the sweet spot between model accuracy and computational cost. The paper explores the trade-offs between using higher precision (e.g., FP32) for better accuracy and lower precision (e.g., FP8) for reduced computational expenses. It highlights that optimal precision isn’t fixed, but rather dynamically depends on factors like model size, training data volume, and available computational resources. The research likely presents a mathematical framework or scaling laws to predict the best precision for a given set of constraints, enabling researchers and developers to optimize training efficiency without significantly sacrificing model performance. Essentially, the “Cost-Optimal Precision” section aims to guide efficient resource allocation by providing a data-driven method for selecting the most appropriate precision level for LLM training, leading to cost savings and faster training times.
Future Work#
The authors suggest several avenues for future research. Extending the scaling laws to larger models and datasets is crucial to validate the model’s generalizability and predictive power beyond the current experimental scope. Investigating the applicability of these laws to different LLM architectures, such as those beyond the Transformer architecture, is vital to broaden the findings’ relevance and practical impact. The study focused on specific floating-point quantization strategies, therefore, exploring other quantization methods will enrich the understanding of the impact of precision on LLM performance. Finally, a deeper investigation into the interaction between various quantization techniques and the scaling laws could reveal valuable insights into the optimization of low-precision LLM training and deployment. Addressing these points would further enhance the practical use and theoretical significance of the presented work.
More visual insights#
More on figures
πΌ The figure shows the comparison of the Chinchilla scaling law with the actual LLM training losses using BF16 precision. The plot visualizes the alignment of predicted losses against empirical losses for various model sizes, demonstrating the accuracy of the Chinchilla scaling law in predicting LLM performance under BF16 precision.
read the caption
(a) Chinchilla basic scaling law.
πΌ This figure shows the comparison of the OpenAI scaling law with the empirical training loss for various model sizes. The plot illustrates the predicted loss versus the actual training loss observed during experiments. This visualization helps assess the accuracy of the OpenAI scaling law in predicting model performance in the context of the paper’s research.
read the caption
(b) OpenAI basic scaling law.
πΌ This figure compares the performance of two established scaling laws β the Chinchilla scaling law and the OpenAI scaling law β against actual results obtained from LLM training. Both laws attempt to predict training loss (L) based on model size (N) and dataset size (D). The plot visually represents the comparison, showing how well each law predicts the observed training losses. The size of each data point corresponds to the dataset size (D). This visualization helps assess the accuracy of the classical scaling laws in predicting LLM training behavior and informs the development of a more precise, precision-aware scaling law.
read the caption
Figure 2: The fitting performance of classical scaling laws. The size of the data point is proportional to Dπ·Ditalic_D.
πΌ This figure illustrates the six different quantization targets considered in the paper: P1 to P6. Each target represents a specific input tensor to the GEMM (General Matrix Multiplication) operations within the Transformer architecture. These GEMMs are involved in the forward and backward passes of the model during training. The paper explores the impact of quantizing each of these tensors individually on the overall model’s performance. The authors ultimately choose to focus on quantizing P2, P4, and P6 in subsequent experiments due to their findings regarding the impact on model accuracy.
read the caption
Figure 3: Quantization Targets. We select P2, P4, and P6 as our quantization targets for the following exploration of scaling laws.
πΌ The bar chart visualizes the performance loss differences when applying various quantization strategies to different components of the transformer model (inputs to GEMM computation). It shows that quantizing input embeddings during backward propagation (P5) leads to significant performance degradation, while quantizing other inputs, especially P2, P4, or P6 alone, yields near-optimal results. Quantizing multiple targets together may not always provide additional benefit. The chart highlights the impact of choosing specific inputs for quantization on model performance.
read the caption
Figure 4: Results of loss gaps with different quantization targets.
πΌ Figure 5 shows the relationship between the hyperparameters Ξ³ and ΞΉ (from the exponent scaling law equation 12) and the model size (N) and data size (D). The plots illustrate that Ξ³ and ΞΉ are not constant values but rather functions of N and D, indicating that their influence on model performance depends on the model and data size. The size of each data point in the plot is proportional to the data size (D), providing a visual representation of the relative data sizes used in the experiments.
read the caption
Figure 5: The correlations between Ξ³πΎ\gammaitalic_Ξ³,ΞΉπ\iotaitalic_ΞΉ in Eq. (12) and NπNitalic_N,Dπ·Ditalic_D. Ξ³πΎ\gammaitalic_Ξ³,ΞΉπ\iotaitalic_ΞΉ could be viewed as functions of NπNitalic_N,Dπ·Ditalic_D. Data point size is proportional to Dπ·Ditalic_D.
πΌ This figure displays the results of fitting the exponent-related scaling law. The graph shows the relationship between predicted and actual loss values for various LLMs trained under different low-precision settings. The size of the data points is directly proportional to the amount of training data used in each experiment (D). This visualization helps assess the accuracy of the proposed Exponent-related scaling law in predicting LLM performance. The graph provides a visual representation of the efficacy of the scaling law to model the effect of the exponent in floating point quantized training on LLM performance.
read the caption
Figure 6: The fitting results of our Exponent-related scaling law. Data point size is proportional to Dπ·Ditalic_D.
πΌ This figure displays the results of the Mantissa-related scaling law, a part of the study on scaling laws for floating-point quantization training of LLMs. The plot shows the correlation between predicted and actual loss values for different Mantissa configurations. The sizes of the data points in the graph are proportional to the size of the training dataset (D), providing a visual representation of the dataset’s influence on the Mantissa scaling law’s accuracy.
read the caption
Figure 7: The fitting results of our Mantissa-related scaling law. Data point size is proportional to Dπ·Ditalic_D.
πΌ This figure displays the results of the joint exponent and mantissa scaling law. It shows how well the model’s predicted loss matches the actual loss across different combinations of exponent bits (E), mantissa bits (M), data size (D), and other parameters. The size of the data points in the subfigures visually represents the relative contribution of D, M, and E respectively to the overall scaling law, allowing for a better visualization of their individual impacts on the model’s performance.
read the caption
Figure 8: The fitting results of the joint Exponent & Mantissa scaling law: Data point sizes in left, middle, and right sub-figures are proportional to Dπ·Ditalic_D, MπMitalic_M, and EπΈEitalic_E, respectively.
πΌ Figure 9 visualizes the relationship between the hyperparameters ΞΊ and Ο (from the logarithmic scaling law in Equation 19) and the model size (N) and dataset size (D). The plots show that ΞΊ and Ο exhibit clear correlations with N and D, suggesting that the impact of block size on model performance is dependent on the model and dataset scales. The size of the data points in the figure is scaled proportionally to the dataset size (D), providing a visual representation of data size’s influence on the correlations.
read the caption
Figure 9: The correlations between ΞΊπ \kappaitalic_ΞΊ,Οπ\psiitalic_Ο in Eq. (19) and NπNitalic_N,Dπ·Ditalic_D. ΞΊπ \kappaitalic_ΞΊ,Οπ\psiitalic_Ο could be viewed as functions of NπNitalic_N,Dπ·Ditalic_D. The data points are scaled proportionally to the value of Dπ·Ditalic_D.
πΌ Figure 10 shows the results of experiments on the impact of block size (B) on the validation loss of LLMs. The scaling law proposed in this work accurately predicts the validation loss for different block sizes (B) and data sizes (D). The left sub-figure shows the correlation between the predicted and actual loss for different data sizes. The right sub-figure emphasizes the relationship between block size (B) and validation loss, showing how accurately the proposed scaling law captures this relationship. In both sub-figures, the size of the data points is directly proportional to the size of the dataset (D) and block size (B), respectively.
read the caption
Figure 10: Our scaling law precisely forecasts validation loss for diverse block sizes. Data point sizes are directly proportional to Dπ·Ditalic_D and Bπ΅Bitalic_B in the respective left and right sub-figures.
πΌ This figure shows the fitting results of the channel-wise scaling law. The x-axis represents the actual loss, and the y-axis represents the predicted loss according to the channel-wise scaling law. Each data point corresponds to a specific model trained with a particular combination of model size (N), data size (D), exponent (E), mantissa (M), and block size of scaling factors (B). The size of the data point is proportional to the data size (D). The plot visually demonstrates how well the channel-wise scaling law predicts the loss compared to the actual results. This figure helps assess the accuracy and applicability of the channel-wise scaling law for estimating the performance of low-precision LLMs in training.
read the caption
Figure 11: The fitting results of the channel-wise scaling law. The size of the data point is proportional to Dπ·Ditalic_D.
πΌ Figure 12 shows the relationship between the block size of scaling factors (B) and the ratio of model size (N) to data size (D). Specifically, it plots logβB against logββ(N/D), illustrating how the choice of block size impacts the scaling behavior as model and dataset sizes vary. The size of each point in the graph corresponds to the dataset size (D), making larger datasets more visually prominent.
read the caption
Figure 12: The correlations between log2β‘Bsubscript2π΅\log_{2}Broman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_B and NDππ·\frac{N}{D}divide start_ARG italic_N end_ARG start_ARG italic_D end_ARG. The size of the data point is proportional to Dπ·Ditalic_D.
πΌ Figure 13 presents the results obtained by fitting the tensor-wise scaling law. This scaling law models the relationship between the training loss of a large language model (LLM) and key parameters, specifically the data size, model size, and block size of scaling factors. The figure visually displays the agreement between the predicted loss values (from the scaling law) and the actual losses observed during experiments using the tensor-wise scaling strategy. The size of each data point in the figure is directly proportional to the data size (D), providing a visual representation of how the data size relates to model performance under the tensor-wise scaling approach. This allows for a visual assessment of the accuracy of the tensor-wise scaling law.
read the caption
Figure 13: The fitting results of the tensor-wise scaling law. The size of the data point is proportional to Dπ·Ditalic_D.
πΌ Figure 14 shows the results of the proposed scaling law for low-precision floating-point training. The plot compares predicted loss values from the scaling law against actual measured losses across a range of training configurations. Each point represents a different training setup. Point size is proportional to the training dataset size (D). The star points show model validation using 1.2 billion parameter models which are not part of the training data used to generate the scaling law.
read the caption
Figure 14: The fitting results of our scaling law for floating-point quantization training. Data point size is proportional to Dπ·Ditalic_D. The star points (1.2B models) are our validation.
πΌ This figure visualizes the optimal allocation of exponent and mantissa bits for various floating-point precisions (4, 8, and 16 bits). It shows how the optimal bit distribution changes as the total number of bits in the floating-point representation increases. The optimal layout is determined by minimizing the loss of information due to quantization, as derived from the proposed scaling law in the paper.
read the caption
Figure 15: The optimal float layouts of different bit widths.
πΌ This figure shows how the training loss changes with respect to the size of the training dataset for different floating-point quantization configurations. The x-axis represents the dataset size (D), and the y-axis represents the training loss (L). Multiple lines are presented, each corresponding to a different combination of exponent (E) and mantissa (M) bits in the floating-point format. This illustrates how the optimal amount of training data might vary depending on the chosen quantization precision.
read the caption
Figure 16: Variation of loss with data size under different floating-point quantization settings.
πΌ This figure shows the relationship between optimal precision and data size under a fixed computational budget. The experiment was conducted with a block size (B) of 128. The results demonstrate that across a wide range of data sizes (0.1T to 100T), the optimal precision consistently falls between 4 and 8 bits. This suggests that a moderate precision is generally optimal, even with very large datasets.
read the caption
Figure 17: Under the constraint of computing the budget with block size (Bπ΅Bitalic_B) set to 128, and based on the results of our experimental data fitting, the optimal precision (PπPitalic_P) values for different data sizes (Dπ·Ditalic_D) can be deduced. As depicted, across a substantially broad range of data sizes from 0.1T to 100T, the optimal precision value consistently falls within the range of 4 to 8 bits.
πΌ Figure 18 shows the relationship between the optimal precision (number of bits used for computation) and the total computational cost. The optimal precision is determined by balancing the trade-off between achieving high accuracy and minimizing the computational resources. As the computational budget increases, the optimal precision increases but eventually plateaus. This is because with a larger budget, the model can afford higher precision without significantly sacrificing performance. This figure highlights that there is a sweet spot for computational cost and precision. It is generated by setting the block size (B) to 128 and k to 6/16 in the equation derived by the authors.
read the caption
Figure 18: The optimal cost-performance ratio precision as a function of the total compute budget, illustrating the relationship between precision (PπPitalic_P) and computational budget (CπΆCitalic_C) when the block size (Bπ΅Bitalic_B) is set to 128 and k=6/16π616k=6/16italic_k = 6 / 16.
More on tables
Constant | Value |
---|---|
n | 69.2343 |
Ξ± | 0.2368 |
d | 68973.0621 |
Ξ² | 0.5162 |
Ο΅ | 1.9061 |
Ξ³ | 11334.5197 |
Ξ΄ | 3.1926 |
Ξ½ | 2.9543 |
πΌ This table presents the fitted hyperparameters and their corresponding values used in the proposed unified scaling law for floating-point quantization training. These parameters, including Ξ±, Ξ², Ξ΄, Ξ½, Ξ³, Ξ·, and Ξ΅, quantify the relationships between various factors such as model size, data size, exponent bits, mantissa bits, and block size, all affecting the performance of LLMs during low-precision training.
read the caption
Table 2: Fitted hyper-parameters and their values in our proposed unified scaling law for floating-point quantization training.
N | D | E | M | B | Fitting support | |
0 | 40894464 | 10485760000 | 0 | 7 | channel | β |
1 | 40894464 | 10485760000 | 1 | 1 | 32 | β |
2 | 40894464 | 10485760000 | 1 | 1 | 64 | β |
3 | 40894464 | 10485760000 | 1 | 1 | 128 | β |
4 | 40894464 | 10485760000 | 1 | 1 | 256 | β |
5 | 40894464 | 10485760000 | 1 | 1 | 512 | β |
6 | 40894464 | 10485760000 | 1 | 1 | channel | β |
7 | 40894464 | 10485760000 | 1 | 1 | tensor | β |
8 | 40894464 | 10485760000 | 1 | 2 | channel | β |
9 | 40894464 | 10485760000 | 1 | 3 | channel | β |
10 | 40894464 | 10485760000 | 1 | 4 | channel | β |
11 | 40894464 | 10485760000 | 1 | 5 | channel | β |
12 | 40894464 | 10485760000 | 1 | 6 | channel | β |
13 | 40894464 | 10485760000 | 2 | 1 | channel | β |
14 | 40894464 | 10485760000 | 2 | 3 | channel | β |
15 | 40894464 | 10485760000 | 3 | 1 | channel | β |
16 | 40894464 | 10485760000 | 3 | 2 | channel | β |
17 | 40894464 | 10485760000 | 4 | 1 | channel | β |
18 | 40894464 | 10485760000 | 4 | 3 | channel | β |
19 | 40894464 | 10485760000 | 4 | 5 | channel | β |
20 | 40894464 | 10485760000 | 5 | 1 | channel | β |
21 | 40894464 | 10485760000 | 5 | 2 | channel | β |
22 | 40894464 | 10485760000 | 6 | 1 | channel | β |
23 | 40894464 | 20971520000 | 0 | 7 | channel | β |
24 | 40894464 | 20971520000 | 1 | 1 | 32 | β |
25 | 40894464 | 20971520000 | 1 | 1 | 64 | β |
26 | 40894464 | 20971520000 | 1 | 1 | 128 | β |
27 | 40894464 | 20971520000 | 1 | 1 | 256 | β |
28 | 40894464 | 20971520000 | 1 | 1 | 512 | β |
29 | 40894464 | 20971520000 | 1 | 1 | channel | β |
30 | 40894464 | 20971520000 | 1 | 1 | tensor | β |
31 | 40894464 | 20971520000 | 1 | 2 | channel | β |
32 | 40894464 | 20971520000 | 1 | 3 | channel | β |
33 | 40894464 | 20971520000 | 1 | 4 | channel | β |
34 | 40894464 | 20971520000 | 1 | 5 | channel | β |
35 | 40894464 | 20971520000 | 1 | 6 | channel | β |
36 | 40894464 | 20971520000 | 2 | 1 | channel | β |
37 | 40894464 | 20971520000 | 2 | 3 | channel | β |
38 | 40894464 | 20971520000 | 3 | 1 | channel | β |
39 | 40894464 | 20971520000 | 3 | 2 | channel | β |
40 | 40894464 | 20971520000 | 4 | 1 | channel | β |
41 | 40894464 | 20971520000 | 4 | 3 | channel | β |
42 | 40894464 | 20971520000 | 4 | 5 | channel | β |
43 | 40894464 | 20971520000 | 5 | 1 | channel | β |
44 | 40894464 | 20971520000 | 5 | 2 | channel | β |
45 | 40894464 | 20971520000 | 6 | 1 | channel | β |
46 | 40894464 | 52428800000 | 0 | 7 | channel | β |
47 | 40894464 | 52428800000 | 1 | 1 | 32 | β |
48 | 40894464 | 52428800000 | 1 | 1 | 64 | β |
49 | 40894464 | 52428800000 | 1 | 1 | 128 | β |
50 | 40894464 | 52428800000 | 1 | 1 | 256 | β |
51 | 40894464 | 52428800000 | 1 | 1 | 512 | β |
52 | 40894464 | 52428800000 | 1 | 1 | channel | β |
53 | 40894464 | 52428800000 | 1 | 1 | tensor | β |
54 | 40894464 | 52428800000 | 1 | 2 | channel | β |
55 | 40894464 | 52428800000 | 1 | 3 | channel | β |
56 | 40894464 | 52428800000 | 1 | 4 | channel | β |
57 | 40894464 | 52428800000 | 1 | 5 | channel | β |
58 | 40894464 | 52428800000 | 1 | 6 | channel | β |
59 | 40894464 | 52428800000 | 2 | 1 | channel | β |
60 | 40894464 | 52428800000 | 2 | 3 | channel | β |
61 | 40894464 | 52428800000 | 3 | 1 | channel | β |
62 | 40894464 | 52428800000 | 3 | 2 | channel | β |
63 | 40894464 | 52428800000 | 4 | 1 | channel | β |
64 | 40894464 | 52428800000 | 4 | 3 | channel | β |
65 | 40894464 | 52428800000 | 4 | 5 | channel | β |
66 | 40894464 | 52428800000 | 5 | 1 | channel | β |
67 | 40894464 | 52428800000 | 5 | 2 | channel | β |
68 | 40894464 | 52428800000 | 6 | 1 | channel | β |
69 | 40894464 | 104857600000 | 0 | 7 | channel | β |
70 | 40894464 | 104857600000 | 1 | 1 | 32 | β |
71 | 40894464 | 104857600000 | 1 | 1 | 64 | β |
72 | 40894464 | 104857600000 | 1 | 1 | 128 | β |
73 | 40894464 | 104857600000 | 1 | 1 | 256 | β |
74 | 40894464 | 104857600000 | 1 | 1 | 512 | β |
75 | 40894464 | 104857600000 | 1 | 1 | channel | β |
76 | 40894464 | 104857600000 | 1 | 1 | tensor | β |
77 | 40894464 | 104857600000 | 1 | 2 | channel | β |
78 | 40894464 | 104857600000 | 1 | 3 | channel | β |
79 | 40894464 | 104857600000 | 1 | 4 | channel | β |
80 | 40894464 | 104857600000 | 1 | 5 | channel | β |
81 | 40894464 | 104857600000 | 1 | 6 | channel | β |
82 | 40894464 | 104857600000 | 2 | 1 | channel | β |
83 | 40894464 | 104857600000 | 2 | 3 | channel | β |
84 | 40894464 | 104857600000 | 3 | 1 | channel | β |
85 | 40894464 | 104857600000 | 3 | 2 | channel | β |
86 | 40894464 | 104857600000 | 4 | 1 | channel | β |
87 | 40894464 | 104857600000 | 4 | 3 | channel | β |
88 | 40894464 | 104857600000 | 4 | 5 | channel | β |
89 | 40894464 | 104857600000 | 5 | 1 | channel | β |
90 | 40894464 | 104857600000 | 5 | 2 | channel | β |
91 | 40894464 | 104857600000 | 6 | 1 | channel | β |
92 | 84934656 | 10485760000 | 0 | 7 | channel | β |
93 | 84934656 | 10485760000 | 1 | 1 | 32 | β |
94 | 84934656 | 10485760000 | 1 | 1 | 64 | β |
95 | 84934656 | 10485760000 | 1 | 1 | 128 | β |
96 | 84934656 | 10485760000 | 1 | 1 | 256 | β |
97 | 84934656 | 10485760000 | 1 | 1 | channel | β |
98 | 84934656 | 10485760000 | 1 | 1 | tensor | β |
99 | 84934656 | 10485760000 | 1 | 2 | channel | β |
100 | 84934656 | 10485760000 | 1 | 3 | channel | β |
101 | 84934656 | 10485760000 | 1 | 4 | channel | β |
102 | 84934656 | 10485760000 | 1 | 5 | channel | β |
103 | 84934656 | 10485760000 | 1 | 6 | channel | β |
104 | 84934656 | 10485760000 | 2 | 1 | channel | β |
105 | 84934656 | 10485760000 | 2 | 3 | channel | β |
106 | 84934656 | 10485760000 | 3 | 1 | channel | β |
107 | 84934656 | 10485760000 | 3 | 2 | channel | β |
108 | 84934656 | 10485760000 | 4 | 1 | channel | β |
109 | 84934656 | 10485760000 | 4 | 3 | channel | β |
110 | 84934656 | 10485760000 | 4 | 5 | channel | β |
111 | 84934656 | 10485760000 | 5 | 1 | channel | β |
112 | 84934656 | 10485760000 | 5 | 2 | channel | β |
113 | 84934656 | 10485760000 | 6 | 1 | channel | β |
114 | 84934656 | 20971520000 | 0 | 7 | channel | β |
115 | 84934656 | 20971520000 | 1 | 1 | 32 | β |
116 | 84934656 | 20971520000 | 1 | 1 | 64 | β |
117 | 84934656 | 20971520000 | 1 | 1 | 128 | β |
118 | 84934656 | 20971520000 | 1 | 1 | 256 | β |
119 | 84934656 | 20971520000 | 1 | 1 | channel | β |
120 | 84934656 | 20971520000 | 1 | 1 | tensor | β |
121 | 84934656 | 20971520000 | 1 | 2 | channel | β |
122 | 84934656 | 20971520000 | 1 | 3 | channel | β |
123 | 84934656 | 20971520000 | 1 | 4 | channel | β |
124 | 84934656 | 20971520000 | 1 | 5 | channel | β |
125 | 84934656 | 20971520000 | 1 | 6 | channel | β |
126 | 84934656 | 20971520000 | 2 | 1 | channel | β |
127 | 84934656 | 20971520000 | 2 | 3 | channel | β |
128 | 84934656 | 20971520000 | 3 | 1 | channel | β |
129 | 84934656 | 20971520000 | 3 | 2 | channel | β |
130 | 84934656 | 20971520000 | 4 | 1 | channel | β |
131 | 84934656 | 20971520000 | 4 | 3 | channel | β |
132 | 84934656 | 20971520000 | 4 | 5 | channel | β |
133 | 84934656 | 20971520000 | 5 | 1 | channel | β |
134 | 84934656 | 20971520000 | 5 | 2 | channel | β |
135 | 84934656 | 20971520000 | 6 | 1 | channel | β |
136 | 84934656 | 52428800000 | 0 | 7 | channel | β |
137 | 84934656 | 52428800000 | 1 | 1 | 32 | β |
138 | 84934656 | 52428800000 | 1 | 1 | 64 | β |
139 | 84934656 | 52428800000 | 1 | 1 | 128 | β |
140 | 84934656 | 52428800000 | 1 | 1 | 256 | β |
141 | 84934656 | 52428800000 | 1 | 1 | channel | β |
142 | 84934656 | 52428800000 | 1 | 1 | tensor | β |
143 | 84934656 | 52428800000 | 1 | 2 | channel | β |
144 | 84934656 | 52428800000 | 1 | 3 | channel | β |
145 | 84934656 | 52428800000 | 1 | 4 | channel | β |
146 | 84934656 | 52428800000 | 1 | 5 | channel | β |
147 | 84934656 | 52428800000 | 1 | 6 | channel | β |
148 | 84934656 | 52428800000 | 2 | 1 | channel | β |
149 | 84934656 | 52428800000 | 2 | 3 | channel | β |
150 | 84934656 | 52428800000 | 3 | 1 | channel | β |
151 | 84934656 | 52428800000 | 3 | 2 | channel | β |
152 | 84934656 | 52428800000 | 4 | 1 | channel | β |
153 | 84934656 | 52428800000 | 4 | 3 | channel | β |
154 | 84934656 | 52428800000 | 4 | 5 | channel | β |
155 | 84934656 | 52428800000 | 5 | 1 | channel | β |
156 | 84934656 | 52428800000 | 5 | 2 | channel | β |
157 | 84934656 | 52428800000 | 6 | 1 | channel | β |
158 | 84934656 | 104857600000 | 0 | 7 | channel | β |
159 | 84934656 | 104857600000 | 1 | 1 | 32 | β |
160 | 84934656 | 104857600000 | 1 | 1 | 64 | β |
161 | 84934656 | 104857600000 | 1 | 1 | 128 | β |
162 | 84934656 | 104857600000 | 1 | 1 | 256 | β |
163 | 84934656 | 104857600000 | 1 | 1 | channel | β |
164 | 84934656 | 104857600000 | 1 | 1 | tensor | β |
165 | 84934656 | 104857600000 | 1 | 2 | channel | β |
166 | 84934656 | 104857600000 | 1 | 3 | channel | β |
167 | 84934656 | 104857600000 | 1 | 4 | channel | β |
168 | 84934656 | 104857600000 | 1 | 5 | channel | β |
169 | 84934656 | 104857600000 | 1 | 6 | channel | β |
170 | 84934656 | 104857600000 | 2 | 1 | channel | β |
171 | 84934656 | 104857600000 | 2 | 3 | channel | β |
172 | 84934656 | 104857600000 | 3 | 1 | channel | β |
173 | 84934656 | 104857600000 | 3 | 2 | channel | β |
174 | 84934656 | 104857600000 | 4 | 1 | channel | β |
175 | 84934656 | 104857600000 | 4 | 3 | channel | β |
176 | 84934656 | 104857600000 | 4 | 5 | channel | β |
177 | 84934656 | 104857600000 | 5 | 1 | channel | β |
178 | 84934656 | 104857600000 | 5 | 2 | channel | β |
179 | 84934656 | 104857600000 | 6 | 1 | channel | β |
180 | 154140672 | 10485760000 | 0 | 7 | channel | β |
181 | 154140672 | 10485760000 | 1 | 1 | 32 | β |
182 | 154140672 | 10485760000 | 1 | 1 | 64 | β |
183 | 154140672 | 10485760000 | 1 | 1 | 128 | β |
184 | 154140672 | 10485760000 | 1 | 1 | 256 | β |
185 | 154140672 | 10485760000 | 1 | 1 | channel | β |
186 | 154140672 | 10485760000 | 1 | 1 | tensor | β |
187 | 154140672 | 10485760000 | 1 | 2 | channel | β |
188 | 154140672 | 10485760000 | 1 | 3 | channel | β |
189 | 154140672 | 10485760000 | 1 | 4 | channel | β |
190 | 154140672 | 10485760000 | 1 | 5 | channel | β |
191 | 154140672 | 10485760000 | 1 | 6 | channel | β |
192 | 154140672 | 10485760000 | 2 | 1 | channel | β |
193 | 154140672 | 10485760000 | 2 | 3 | channel | β |
194 | 154140672 | 10485760000 | 3 | 1 | channel | β |
195 | 154140672 | 10485760000 | 3 | 2 | channel | β |
196 | 154140672 | 10485760000 | 4 | 1 | channel | β |
197 | 154140672 | 10485760000 | 4 | 3 | channel | β |
198 | 154140672 | 10485760000 | 4 | 5 | channel | β |
199 | 154140672 | 10485760000 | 5 | 1 | channel | β |
200 | 154140672 | 10485760000 | 5 | 2 | channel | β |
201 | 154140672 | 10485760000 | 6 | 1 | channel | β |
202 | 154140672 | 20971520000 | 0 | 7 | channel | β |
203 | 154140672 | 20971520000 | 1 | 1 | 32 | β |
204 | 154140672 | 20971520000 | 1 | 1 | 64 | β |
205 | 154140672 | 20971520000 | 1 | 1 | 128 | β |
206 | 154140672 | 20971520000 | 1 | 1 | 256 | β |
207 | 154140672 | 20971520000 | 1 | 1 | channel | β |
208 | 154140672 | 20971520000 | 1 | 1 | tensor | β |
209 | 154140672 | 20971520000 | 1 | 2 | channel | β |
210 | 154140672 | 20971520000 | 1 | 3 | channel | β |
211 | 154140672 | 20971520000 | 1 | 4 | channel | β |
212 | 154140672 | 20971520000 | 1 | 5 | channel | β |
213 | 154140672 | 20971520000 | 1 | 6 | channel | β |
214 | 154140672 | 20971520000 | 2 | 1 | channel | β |
215 | 154140672 | 20971520000 | 2 | 3 | channel | β |
216 | 154140672 | 20971520000 | 3 | 1 | channel | β |
217 | 154140672 | 20971520000 | 3 | 2 | channel | β |
218 | 154140672 | 20971520000 | 4 | 1 | channel | β |
219 | 154140672 | 20971520000 | 4 | 3 | channel | β |
220 | 154140672 | 20971520000 | 4 | 5 | channel | β |
221 | 154140672 | 20971520000 | 5 | 1 | channel | β |
222 | 154140672 | 20971520000 | 5 | 2 | channel | β |
223 | 154140672 | 20971520000 | 6 | 1 | channel | β |
224 | 154140672 | 52428800000 | 0 | 7 | channel | β |
225 | 154140672 | 52428800000 | 1 | 1 | 32 | β |
226 | 154140672 | 52428800000 | 1 | 1 | 64 | β |
227 | 154140672 | 52428800000 | 1 | 1 | 128 | β |
228 | 154140672 | 52428800000 | 1 | 1 | 256 | β |
229 | 154140672 | 52428800000 | 1 | 1 | channel | β |
230 | 154140672 | 52428800000 | 1 | 1 | tensor | β |
231 | 154140672 | 52428800000 | 1 | 2 | channel | β |
232 | 154140672 | 52428800000 | 1 | 3 | channel | β |
233 | 154140672 | 52428800000 | 1 | 4 | channel | β |
234 | 154140672 | 52428800000 | 1 | 5 | channel | β |
235 | 154140672 | 52428800000 | 1 | 6 | channel | β |
236 | 154140672 | 52428800000 | 2 | 1 | channel | β |
237 | 154140672 | 52428800000 | 2 | 3 | channel | β |
238 | 154140672 | 52428800000 | 3 | 1 | channel | β |
239 | 154140672 | 52428800000 | 3 | 2 | channel | β |
240 | 154140672 | 52428800000 | 4 | 1 | channel | β |
241 | 154140672 | 52428800000 | 4 | 3 | channel | β |
242 | 154140672 | 52428800000 | 4 | 5 | channel | β |
243 | 154140672 | 52428800000 | 5 | 1 | channel | β |
244 | 154140672 | 52428800000 | 5 | 2 | channel | β |
245 | 154140672 | 52428800000 | 6 | 1 | channel | β |
246 | 154140672 | 104857600000 | 0 | 7 | channel | β |
247 | 154140672 | 104857600000 | 1 | 1 | 32 | β |
248 | 154140672 | 104857600000 | 1 | 1 | 64 | β |
249 | 154140672 | 104857600000 | 1 | 1 | 128 | β |
250 | 154140672 | 104857600000 | 1 | 1 | 256 | β |
251 | 154140672 | 104857600000 | 1 | 1 | channel | β |
252 | 154140672 | 104857600000 | 1 | 1 | tensor | β |
253 | 154140672 | 104857600000 | 1 | 2 | channel | β |
254 | 154140672 | 104857600000 | 1 | 3 | channel | β |
255 | 154140672 | 104857600000 | 1 | 4 | channel | β |
256 | 154140672 | 104857600000 | 1 | 5 | channel | β |
257 | 154140672 | 104857600000 | 1 | 6 | channel | β |
258 | 154140672 | 104857600000 | 2 | 1 | channel | β |
259 | 154140672 | 104857600000 | 2 | 3 | channel | β |
260 | 154140672 | 104857600000 | 3 | 1 | channel | β |
261 | 154140672 | 104857600000 | 3 | 2 | channel | β |
262 | 154140672 | 104857600000 | 4 | 1 | channel | β |
263 | 154140672 | 104857600000 | 4 | 3 | channel | β |
264 | 154140672 | 104857600000 | 4 | 5 | channel | β |
265 | 154140672 | 104857600000 | 5 | 1 | channel | β |
266 | 154140672 | 104857600000 | 5 | 2 | channel | β |
267 | 154140672 | 104857600000 | 6 | 1 | channel | β |
268 | 679477248 | 10485760000 | 0 | 7 | channel | β |
269 | 679477248 | 10485760000 | 1 | 1 | 32 | β |
270 | 679477248 | 10485760000 | 1 | 1 | 64 | β |
271 | 679477248 | 10485760000 | 1 | 1 | 128 | β |
272 | 679477248 | 10485760000 | 1 | 1 | 256 | β |
273 | 679477248 | 10485760000 | 1 | 1 | 512 | β |
274 | 679477248 | 10485760000 | 1 | 1 | channel | β |
275 | 679477248 | 10485760000 | 1 | 1 | tensor | β |
276 | 679477248 | 10485760000 | 1 | 2 | channel | β |
277 | 679477248 | 10485760000 | 1 | 3 | channel | β |
278 | 679477248 | 10485760000 | 1 | 4 | channel | β |
279 | 679477248 | 10485760000 | 1 | 5 | channel | β |
280 | 679477248 | 10485760000 | 1 | 6 | channel | β |
281 | 679477248 | 10485760000 | 2 | 1 | channel | β |
282 | 679477248 | 10485760000 | 2 | 3 | channel | β |
283 | 679477248 | 10485760000 | 3 | 1 | channel | β |
284 | 679477248 | 10485760000 | 3 | 2 | channel | β |
285 | 679477248 | 10485760000 | 4 | 1 | channel | β |
286 | 679477248 | 10485760000 | 4 | 3 | channel | β |
287 | 679477248 | 10485760000 | 4 | 5 | channel | β |
288 | 679477248 | 10485760000 | 5 | 1 | channel | β |
289 | 679477248 | 10485760000 | 5 | 2 | channel | β |
290 | 679477248 | 10485760000 | 6 | 1 | channel | β |
291 | 679477248 | 20971520000 | 0 | 7 | channel | β |
292 | 679477248 | 20971520000 | 1 | 1 | 32 | β |
293 | 679477248 | 20971520000 | 1 | 1 | 64 | β |
294 | 679477248 | 20971520000 | 1 | 1 | 128 | β |
295 | 679477248 | 20971520000 | 1 | 1 | 256 | β |
296 | 679477248 | 20971520000 | 1 | 1 | 512 | β |
297 | 679477248 | 20971520000 | 1 | 1 | channel | β |
298 | 679477248 | 20971520000 | 1 | 1 | tensor | β |
299 | 679477248 | 20971520000 | 1 | 2 | channel | β |
300 | 679477248 | 20971520000 | 1 | 3 | channel | β |
301 | 679477248 | 20971520000 | 1 | 4 | channel | β |
302 | 679477248 | 20971520000 | 1 | 5 | channel | β |
303 | 679477248 | 20971520000 | 1 | 6 | channel | β |
304 | 679477248 | 20971520000 | 2 | 1 | channel | β |
305 | 679477248 | 20971520000 | 2 | 3 | channel | β |
306 | 679477248 | 20971520000 | 3 | 1 | channel | β |
307 | 679477248 | 20971520000 | 3 | 2 | channel | β |
308 | 679477248 | 20971520000 | 4 | 1 | channel | β |
309 | 679477248 | 20971520000 | 4 | 3 | channel | β |
310 | 679477248 | 20971520000 | 4 | 5 | channel | β |
311 | 679477248 | 20971520000 | 5 | 1 | channel | β |
312 | 679477248 | 20971520000 | 5 | 2 | channel | β |
313 | 679477248 | 20971520000 | 6 | 1 | channel | β |
314 | 679477248 | 52428800000 | 0 | 7 | channel | β |
315 | 679477248 | 52428800000 | 1 | 1 | 32 | β |
316 | 679477248 | 52428800000 | 1 | 1 | 64 | β |
317 | 679477248 | 52428800000 | 1 | 1 | 128 | β |
318 | 679477248 | 52428800000 | 1 | 1 | 256 | β |
319 | 679477248 | 52428800000 | 1 | 1 | 512 | β |
320 | 679477248 | 52428800000 | 1 | 1 | channel | β |
321 | 679477248 | 52428800000 | 1 | 1 | tensor | β |
322 | 679477248 | 52428800000 | 1 | 2 | channel | β |
323 | 679477248 | 52428800000 | 1 | 3 | channel | β |
324 | 679477248 | 52428800000 | 1 | 4 | channel | β |
325 | 679477248 | 52428800000 | 1 | 5 | channel | β |
326 | 679477248 | 52428800000 | 1 | 6 | channel | β |
327 | 679477248 | 52428800000 | 2 | 1 | channel | β |
328 | 679477248 | 52428800000 | 2 | 3 | channel | β |
329 | 679477248 | 52428800000 | 3 | 1 | channel | β |
330 | 679477248 | 52428800000 | 3 | 2 | channel | β |
331 | 679477248 | 52428800000 | 4 | 1 | channel | β |
332 | 679477248 | 52428800000 | 4 | 3 | channel | β |
333 | 679477248 | 52428800000 | 4 | 5 | channel | β |
334 | 679477248 | 52428800000 | 5 | 1 | channel | β |
335 | 679477248 | 52428800000 | 5 | 2 | channel | β |
336 | 679477248 | 52428800000 | 6 | 1 | channel | β |
337 | 679477248 | 104857600000 | 0 | 7 | channel | β |
338 | 679477248 | 104857600000 | 1 | 1 | 32 | β |
339 | 679477248 | 104857600000 | 1 | 1 | 64 | β |
340 | 679477248 | 104857600000 | 1 | 1 | 128 | β |
341 | 679477248 | 104857600000 | 1 | 1 | 256 | β |
342 | 679477248 | 104857600000 | 1 | 1 | 512 | β |
343 | 679477248 | 104857600000 | 1 | 1 | channel | β |
344 | 679477248 | 104857600000 | 1 | 1 | tensor | β |
345 | 679477248 | 104857600000 | 1 | 2 | channel | β |
346 | 679477248 | 104857600000 | 1 | 3 | channel | β |
347 | 679477248 | 104857600000 | 1 | 4 | channel | β |
348 | 679477248 | 104857600000 | 1 | 5 | channel | β |
349 | 679477248 | 104857600000 | 1 | 6 | channel | β |
350 | 679477248 | 104857600000 | 2 | 1 | channel | β |
351 | 679477248 | 104857600000 | 2 | 3 | channel | β |
352 | 679477248 | 104857600000 | 3 | 1 | channel | β |
353 | 679477248 | 104857600000 | 3 | 2 | channel | β |
354 | 679477248 | 104857600000 | 4 | 1 | channel | β |
355 | 679477248 | 104857600000 | 4 | 3 | channel | β |
356 | 679477248 | 104857600000 | 4 | 5 | channel | β |
357 | 679477248 | 104857600000 | 5 | 2 | channel | β |
358 | 679477248 | 104857600000 | 6 | 1 | channel | β |
359 | 1233125376 | 10485760000 | 1 | 2 | 512 | β |
360 | 1233125376 | 10485760000 | 4 | 3 | 512 | β |
361 | 1233125376 | 20971520000 | 1 | 2 | 512 | β |
362 | 1233125376 | 20971520000 | 4 | 3 | 512 | β |
363 | 1233125376 | 52428800000 | 1 | 2 | 512 | β |
364 | 1233125376 | 52428800000 | 4 | 3 | 512 | β |
365 | 1233125376 | 104857600000 | 1 | 2 | 512 | β |
366 | 1233125376 | 104857600000 | 4 | 3 | 512 | β |
πΌ This table details the hyperparameter settings used in the ablation experiments conducted in the paper. Each row represents a unique experiment configuration, specifying the model size (N), dataset size (D), exponent bits (E), mantissa bits (M), and block size (B). The ‘Fitting support’ column indicates whether the corresponding experiment’s results were used for fitting the scaling laws presented in the paper. These ablation studies systematically investigated the impact of various floating-point quantization parameters to better understand their contribution to model performance and the accuracy of the scaling laws.
read the caption
Table 3: All configurations for the ablation experiments.