Mark III Systems Blog

Benchmarking LLM, Multi-GPU Finetuning Training Strategies with PyTorch Lightning on NVIDIA DGX

If you have ever attempted to finetune a >1B parameter LLM on one GPU you have probably seen training take several hours even when using time and memory saving strategies like LoRA.  You may have wondered how much time could be saved by using more GPUs, or even several nodes of GPU servers.  In this blog I will be using the same training setup with multiple hardware and training strategy configurations and benchmarking the times to train 1 epoch. The hardware used is 2 NVIDIA DGX servers each containing 8x GPUs.

I found that the easiest framework to set up for multi-node training is Pytorch Lightning and I use the Deepspeed ZeRO 2 and 3 as my training strategies. For more information about ZeRO see the distributed training strategies overview blog.

1 GPU / 1 DGX


When only using 1x DGX V100 GPU this training job is going to take a little over an hour to complete one epoch. This will serve a benchmark to compare our distributed training jobs to. Let’s see if switching to 8 GPUs speeds things up.

8 GPUs / 1 DGX / DeepSpeed ZeRO 2


Unsurprisingly, using 8 V100 GPUs rather than 1 resulted in a much faster execution time for our training job. This job took ~ 9 minutes which is a >6x speedup from our previous run. Now let’s increase the number of GPUs even more by adding another node to the job.

16 GPUs / 2 DGXs / DeepSpeed ZeRO 2


When using both nodes we see another speedup in our training job, which took a little less than 8 minutes. However, the speedup was not as dramatic as when we went from 1 to 8 GPUs even though we are now using a total of 16 GPUs.There are a few reasons for this. 

First, when using more GPUs and nodes you incur additional time to set up all the devices needed for training. There is more communication that needs to happen between all those devices as well. The most problematic example of this is communication between our two DGX nodes. The cluster I’m benchmarking with has 1000Mb/s network speeds. This is a major bottleneck for our job as we will see more evidently in the next benchmarking example using DeepSpeed ZeRO 3.

16 GPUs / 2 DGXs / DeepSpeed ZeRO 3


When we switch to DeepSpeed ZeRO 3 we see a massive slowdown vs ZeRO 2 on the same 2 node setup. Running this job took about 3 times as long as running the same job on just one GPU. This may seem like a surprising result but can be explained by the network speed we talked about previously. By using ZeRO 3 we are sharing even more data between nodes (specifically all the model parameters), causing the bottleneck of our network speed to become a much larger hindrance. 

8 GPUs / 1 DGX / DeepSpeed ZeRO 3


Just to double check that the massive slowdown was due to the inter-node network speed I ran the job again using ZeRO 3 on just one node. The training time was now only a few minutes slower than the 1 node ZeRO 2 job and is much faster than the 2 node job we just looked at. This confirms how important network speed is when attempting multi-node distributed training.