Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models

4Un2TD9bNe

Hanxiao Zhang et el.

TL;DR
#

Training large language models (LLMs) efficiently requires overcoming significant memory and communication challenges in distributed training. Existing methods, like ZeRO, offer limited options and often struggle with heterogeneous network setups where intra-group communication outperforms inter-group communication. This paper aims to address these limitations by proposing improved strategies for LLM training.

The authors introduce PaRO (Partial Redundancy Optimizer), a novel framework that offers refined model state partitioning (PaRO-DP) and tailored collective communication (PaRO-CC). PaRO-DP provides more trade-off options between memory and communication costs compared to existing strategies, improving training speed. PaRO-CC optimizes communication by leveraging intra- and inter-group performance differences. Experiments show that PaRO improves training speed significantly, up to 266% faster than ZeRO-3. The paper provides a guideline for selecting the best PaRO strategy, ensuring minimal performance loss.

Key Takeaways
#

Why does it matter?
#

This paper is crucial for researchers working on large language model (LLM) training because it introduces novel strategies for optimizing data parallelism. The PaRO framework significantly improves training speed, offering practical solutions to address memory and communication bottlenecks common in distributed LLM training. The findings open new avenues for optimizing existing and future LLM training frameworks, especially for those using heterogeneous network architectures.

Visual Insights
#

This figure shows two schematics illustrating the PaRO-DP (Partial Redundancy Optimizer-Data Parallelism) training strategy using four GPUs divided into two groups. Schematic (a) depicts the PIIG strategy (no parameter partitioning, intra-group gradient partitioning, global optimizer state partitioning). Schematic (b) shows the PNIG strategy (no parameter partitioning, intra-group gradient partitioning, global optimizer state partitioning). Both diagrams detail the forward pass, backward pass, gradient accumulation and update steps, highlighting intra-group and inter-group communication operations with specific prefixes. The figure demonstrates the proposed PaRO-DP’s ability to optimize memory and communication costs by strategically partitioning model states.

This table systematically lists all 27 possible combinations of partitioning strategies for model parameters (p), gradients (g), and optimizer states (os) across three levels of granularity: no partitioning (N), intra-group partitioning (I), and global partitioning (G). Each combination is represented as Pp+g+os, where P, g, and os represent the partitioning strategies for parameters, gradients, and optimizer states, respectively. A checkmark indicates effective combinations identified by the authors, while a cross indicates strategies eliminated based on the authors’ analysis of the trade-off between memory and communication costs.

In-depth insights
#

PaRO: DP Strategy
#

The PaRO (Partial Redundancy Optimizer) data parallelism (DP) strategy offers a novel approach to training large language models (LLMs) by rethinking the trade-off between memory and communication costs. Unlike existing methods that rely on a single partitioning strategy, PaRO introduces a refined model state partitioning, providing multiple options to balance these costs based on specific hardware configurations and LLM sizes. This flexibility is crucial, especially in heterogeneous environments where intra- and inter-group communication speeds differ significantly. By carefully managing redundancy, PaRO-DP aims to minimize inter-group communication, thus accelerating the overall training process. The core innovation lies in the tailored training procedures, which adapt to the chosen partitioning strategy, ensuring efficient memory utilization while avoiding redundant communication. A key aspect is the guideline for selecting the optimal strategy, based on quantitative calculations, thus eliminating the need for extensive experimentation and potentially reducing training time. This methodical approach, combined with the potential to enhance efficiency for partial parameter training and PEFT methods, makes PaRO-DP a significant advancement in LLM training optimization.

Communication Gaps
#

Analyzing communication inefficiencies in large language model (LLM) training reveals significant performance bottlenecks. Heterogeneous network architectures, common in large-scale GPU clusters, introduce disparities between intra-group (high-speed) and inter-group (lower-speed) communication. This disparity significantly impacts collective communication operations (all-reduce, all-gather, etc.) frequently employed in data parallel training strategies. Strategies like ZeRO, while effective at reducing memory costs, often exacerbate these communication gaps due to their reliance on frequent global synchronization steps. Addressing these gaps requires innovative approaches beyond traditional techniques. This necessitates a shift toward refined model state partitioning strategies to minimize unnecessary inter-group communication, and possibly, clever network topology restructuring to fully exploit intra-group bandwidth. Techniques that can minimize global communication while maximizing local processing are key to achieving substantial improvements in LLM training efficiency.

PaRO-DP: Design
#

The PaRO-DP design centers on optimizing data parallel training of large language models (LLMs) by addressing memory and communication cost trade-offs. It cleverly leverages the disparity between intra- and inter-group communication speeds within a cluster, introducing a novel set of strategies that refine model state partitioning. This partitioning goes beyond traditional approaches, offering more choices and flexibility in the allocation of parameters, gradients, and optimizer states across GPUs. A key element is the introduction of partial redundancy, allowing for faster intra-group communication at the cost of some memory increase. This innovative approach is supported by a quantitative guideline for selecting the most effective strategy based on specific hardware and model characteristics. The design is notable for its systematic consideration of memory-communication trade-offs and its adaptability to varying training scenarios, such as full-parameter and partial-parameter training, ultimately leading to significant LLM training speed improvements.

PaRO-CC: Topology
#

PaRO-CC’s innovative topology reimagines collective communication in distributed training, addressing the performance bottleneck of heterogeneous networks. Instead of a simple ring topology, PaRO-CC structures communication in two layers: intra-group and inter-group. This hierarchical design leverages high-speed intra-group connections (e.g., NVLink) within each GPU machine or switch, significantly reducing the reliance on slower inter-group communication (e.g., RDMA over Ethernet). By carefully orchestrating the communication flow between these layers—executing intra-group and inter-group operations in parallel wherever possible—PaRO-CC minimizes latency and maximizes throughput. This strategy allows for greater scalability and efficiency, especially in large-scale clusters with varying levels of network connectivity. The intelligent re-arrangement of the communication topology is key; it’s adaptive and adaptable, tailoring to the specific architecture of the GPU cluster, thereby improving overall training speed and reducing training time. The results demonstrate a substantial improvement in efficiency compared to the traditional ring structure, illustrating the practical value of this topology-focused approach to optimizing collective communications.

Scaling Experiments
#

A robust ‘Scaling Experiments’ section would delve into how the model’s performance changes with increasing resources. This would involve systematically varying the number of GPUs and observing the impact on metrics such as training throughput, memory usage, and communication overhead. Detailed graphs visualizing these relationships are crucial, showing any deviations from linear scaling, and potentially identifying bottlenecks. The analysis should carefully consider the trade-offs between scalability and cost-efficiency, evaluating whether the gains from additional resources justify the increased expenses. The experiment setup should be described in detail, including hardware specifications, software versions, and training hyperparameters, ensuring reproducibility. A comparison with other state-of-the-art models is vital to establish its competitive advantage in handling large-scale datasets. The results should be analyzed in the context of different model sizes and training objectives, highlighting which scaling strategies prove most effective under various scenarios. Finally, a discussion on limitations of scalability, including potential communication and synchronization challenges, is vital for a holistic perspective.

More visual insights
#

More on tables

This table shows how data blocks are distributed among GPUs based on different partitioning strategies: No partitioning, Intra-group partitioning, and Global partitioning. For each strategy, it specifies the indices of the data blocks that reside on a given GPU within a group and across different groups.

This table details the collective communication operations used for synchronization between blocks of different partitioning in the PaRO-DP framework. It shows the input and output blocks, participation ranks (GPU indices), and a description of each operation (global all-gather, global reduce_scatter, etc.), indicating how they are optimized by PaRO-CC.

This table presents the results of training LLaMA-7B with different numbers of trainable parameters (full parameters, Ψ’=Ψ; partial parameters, Ψ’=Ψ/16). It compares various strategies in terms of throughput (1/T, samples per second) and peak GPU memory usage (Mem(GB)). The table allows for the comparison of different approaches under different memory constraints.

This table presents all possible combinations of partitioning strategies for model parameters (p), gradients (g), and optimizer states (os) across three levels of granularity: no partitioning (N), intra-group partitioning (I), and global partitioning (G). Each combination is represented as Pp+g+os (e.g., NNN, IIG), indicating the partitioning level for each component. The table highlights combinations identified as effective and those deemed ineffective, based on the analysis presented in the paper. Ineffective strategies were eliminated based on an analysis of memory usage and communication costs, considering the tradeoffs between these two aspects. The effective strategies form the basis of the PaRO-DP approach.

This table presents the training throughput (1/T) and GPU memory usage for the LLaMA-7B model under two different scenarios: full-parameter training (Ψ’ = Ψ) and partial-parameter training (Ψ’ = Ψ/16). It compares various PaRO-DP strategies against existing methods like ZeRO and MiCS, showing the throughput and peak memory used by each strategy. The results highlight the effectiveness of PaRO-DP in improving training throughput while maintaining reasonable memory consumption.

This table presents the results of training throughput (1/T) and GPU memory usage for the LLaMA-7B model under two different scenarios: full-parameter training (Ψ’ = Ψ) and partial-parameter training (Ψ’ = Ψ/16). It compares various PaRO-DP strategies against existing methods such as ZeRO-2, ZeRO-3, MiCS, and FSDP-hz. The table helps demonstrate the performance and memory efficiency improvements achieved with PaRO-DP, particularly in the partial-parameter training case, which is important for efficiency. Note that 1/T is a measure of training speed.

This table presents all possible combinations of model states partitioning strategies (Parameter, Gradient, Optimizer state) across three levels of granularity (No partitioning, Intra-group partitioning, Global partitioning). Each combination is evaluated for effectiveness, with a checkmark indicating effective strategies and a cross indicating ineffective strategies based on the insights from the paper. This helps determine the optimal partitioning based on the specific needs of model size, memory usage and communication requirements.

This table presents the calculation formula for tparam (time cost of all-gather for parameters) under three different partitioning granularities: no partitioning (N), intra-group partitioning (I), and global partitioning (G). It shows how the time cost changes depending on the number of GPUs in a group (m), the total number of GPUs (n), the bandwidth between GPUs within a group (B’), the bandwidth between groups (B), and the number of parameters (Ψ).

This table shows the calculation formula of the time cost (tgradient) for gradient synchronization during the backward pass in different model state partitioning strategies (No partitioning (N), Intra-group partitioning (I), and Global partitioning (G)). The formula considers the number of parameters (Ψ’), the number of GPUs in a group (m), the total number of GPUs (n), inter-group bandwidth (B), and intra-group bandwidth (B’).

This table details the calculation formulas for the time cost of synchronizing gradients (tsyncg) in various partitioning strategies for the optimizer states. The formulas account for the number of parameters (Ψ’), the number of GPUs (n), the number of GPUs per group (m), the number of groups (g), inter-group bandwidth (B), and intra-group bandwidth (B’).

This table presents the calculation formulas for the time cost (tsyncp) of synchronizing model parameters after they have been updated, considering different partitioning strategies (no partitioning (N), intra-group partitioning (I), and global partitioning (G)) for parameters (p) and optimizer states (os). The formulas account for the number of trainable parameters (Ψ’), the number of GPUs in a group (m), the total number of GPUs (n), the number of groups (g = n/m), intra-group bandwidth (B’), and global bandwidth (B). The time cost is 0 when neither parameters nor optimizer states are partitioned.

This table presents the results of training throughput (1/T) and GPU memory usage for the LLaMA-7B model under two different scenarios: full-parameter training (Ψ’ = Ψ) and partial-parameter training (Ψ’ = Ψ/16). For each scenario, it shows the performance of various PaRO-DP strategies along with some existing strategies like ZeRO-2 and ZeRO-3 for comparison. The results are intended to demonstrate the effectiveness of PaRO-DP strategies in optimizing the trade-off between training speed and memory consumption.

This table presents the results of training throughput (1/T) and GPU memory usage for the LLaMA-7B model under two different scenarios: full-parameter training (Ψ’=Ψ) and partial-parameter training (Ψ’=Ψ/16). It compares the performance of various PaRO-DP strategies against existing methods such as ZeRO-2, ZeRO-3, MiCS, and FSDP-hz. The 1/T values represent the training speed, while Mem(GB) shows the GPU memory consumption for each strategy.

This table presents the measured inter- and intra-group GPU-to-GPU communication throughput in the experimental environment. It shows the transfer size (in bytes) and duration (in milliseconds) for both intra-node and inter-node communication and calculates the throughput in Gigabits per second (Gbps). This highlights the significant performance difference between communication within a single node and communication between nodes.

This table presents a comparison of different training strategies (including PaRO-DP strategies and other state-of-the-art methods) in terms of throughput and peak memory usage when training large language models with full trainable parameters. It showcases the superior performance of several PaRO-DP strategies compared to existing approaches under the same experimental setup.

This table presents the results of experiments conducted on 7B and 65B LLaMA models under partial-parameters training conditions (Ψ’=Ψ/16). It compares various strategies including PaRO-DP, ZeRO++, and FSDP-hz, evaluating their throughput and peak memory usage. The configurations used for each strategy are detailed, allowing for reproducibility. The table shows that PaRO-DP strategies generally achieve higher throughput while maintaining comparable or lower peak memory compared to other methods.

This table presents the results of experiments conducted using Parameter-Efficient Fine-Tuning (PEFT) with a ratio of trainable parameters to model parameters (Ψ’/Ψ) set to 3/1000. It compares the throughput and peak memory usage of different strategies (GGG (ZeRO-3), ING (PaRO), and ZERO++) in training the LLaMA-65B model, showcasing the performance improvement achieved by PaRO in the PEFT setting.

This table compares the throughput of different strategies for training the LLaMA-7B model on 32 GPUs, when using a full-parameter training setup (Ψ′ = Ψ). It shows the throughput (in samples/sec) for three different configurations of micro-batch size (MBS), accumulation steps (AS), and effective batch size (EBS). The table highlights the impact of dynamic effective batch size on training efficiency.

This table compares the throughput of different training strategies (IIG (PaRO) and GGG (ZeRO-3)) for the LLaMA-65B model on 64 GPUs, while varying the effective batch size. The results show that the IIG strategy of PaRO significantly outperforms ZeRO-3 under most conditions.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

PaRO: DP Strategy#

Communication Gaps#

PaRO-DP: Design#

PaRO-CC: Topology#

Scaling Experiments#

More visual insights#

Full paper#