Mark III Systems Blog

NVIDIA – RAPIDS Benchmarking

If you have ever used Pandas, you know that it is a great tool for creating and working with data frames.  As with any software, optimizations can be done to make it perform better.  NVIDIA has developed software called RAPIDS, a data science framework that includes a collection of libraries for executing end-to-end data science pipelines completely on the GPU (www.rapids.ai).  Included in RAPIDS is cuDF, which allows data frames to be loaded and manipulated on a GPU.

In this post, I am going to discuss some benchmarking I have done with RAPIDS, particularly cuDF.  I conducted multiple experiments where I created data frames that ran on the CPU using Pandas and DataFrames that ran on the GPU using cuDF, then executed common methods on those data frames.  I will be using the term “processes” to describe the execution of the methods.  I will also be using the convention that NVIDIA has used, in which the Pandas DataFrames are named PDF, for Pandas DataFrame, and the cuDF DataFrames are named GDF, for GPU-dataframe.

For my benchmarking data, I used a CSV file from MIMIC-III, a freely accessible critical care database that was created by the MIT Lab for Computational Physiology (https://mimic.physionet.org/).  The file was named MICROBIOLOGYEVENTS.csv.  It consisted of 631,726 rows and 16 columns of data.  I duplicated the records in the file to create new files that consisted of 5 million, 10 million, 20 million, and 40 million rows and 16 columns of data, respectively.  Experiments were then conducted using each of these 5 files.  An individual experiment is defined as running a processes on one of the 5 versions of the MICROBIOLOGYEVENTS files as input.  Each experiment was repeated 5 times and the results averaged together to give the final results that I list.

The benchmarking was done on both an NVIDIA DGX-1 and an IBM Power System AC922 using a single GPU in each.  The GPUs in the servers were both NVIDIA V100 models, with the DGX-1 GPU having the model with 32GB of RAM and the AC922 having the 16GB model.

For the benchmarking, I ran some common processes on both PDF and GDF dataframes and calculated the amount of time it took to run.  The processes were done in the following order using a Jupyter Notebook that can be found on my Github here.

·       Loading the file from a csv into PDF and GDF data frames.

·       Finding the number of unique values in one column.

·       Finding the number of rows with unique values in one column

·       Finding the 5 smallest and largest values in one column.

·       Selecting 5 specific rows in a column by index.

·       Sorting the data frame by values in one column.

·       Creating a new column with no data in it.

·       Creating a new column that was populated with a calculated value.  It took the value of a preexisting column and multiplied it by 2 to get the calculated value.

·       Dropping a column.

In addition, I created an additional Jupyter Notebook that was used to concatenate 2 dataframes.  In this experiment, the MICROBIOLOGYEVENTS.csv, which has 631,726 rows, was concatenated onto each of the 5 MICROBIOLOGYEVENTS input files.

Results

In 4 of the 9 experiments, the GDF outperformed the PDF regardless of the input file that was used.  In 3 experiments, the PDF outperformed the GDF.  Interestingly, in 2 experiments the PDF outperformed the GDF on small dataframes but not on the larger ones.  In the concatenation experiments, the GDF always outperformed the PDF.  The results for the processes that were run on the AC922 are below.  The results for the DGX-1 are similar.  For complete results, including the actual times for the processes to run and the DGX-1 results, see my Github.

The most remarkable differences in performance were in the following processes.

GDF Outperforms PDF

·       For time to load the input file, the GDF outperformed the PDF by an average of 8.3x faster (range 4.3x-9.5x).  For the input file with 40 million records, the GDF was created and loaded in 5.87 seconds while the PDF took 56.03 seconds.

·       When sorting the data frame by values in one column, the GDF outperformed the PDF by an average of 15.5x faster (range 2.1x-23.4x).  Due to the GPU in the AC922 only having 16GB of RAM, the 40 million row data frame was not able to be sorted so these number include the results of the sort on the DGX-1 for the 40 million row data frame.

·       When creating a new column that was populated with a calculated value, the GDF outperformed the PDF by an average of 4.8x faster (range 2.0x-7.1x).

·       The most remarkable performance difference was seen when dropping a single column.  Amazingly, the GDP outperformed the PDF by an average of 3,979.5x faster (range 255.7x-9,736.9x).  Performance scaled linearly as the size of the data frame became larger.

·       When concatenating the 631,726 row data frame onto another data frame, the GDF outperformed the PDF by an average of 10.4x faster (range 1.2x-29.0x).  As with sorting, the 16GB GPU ran out of memory when trying to append the data frame onto the 40 million row data frame sorted so these number include the results of the sort on the DGX-1 for the 40 million row data frame.

PDF Outperforms GDF

When the PDF outperformed the GDF, the results were astonishing.

·       When finding the number of unique values in one column, the PDF outperformed the GDF by an average of 74.1x (range 5.7x-286.0x).  However, as the size of the data frame increased, the performance difference reduced dramatically, from 285.9x to 5.7x.  This suggests that at a point the GDF would most likely perform better, but additional experiments would need to be conducted to demonstrate that.  This trend is seen in the processes where the PDF outperforms the GDF on smaller data frames.

·       The most remarkable performance difference was seen when selecting 5 specific rows in a column by index.  In this case, the PDF outperformed the GDF by an average of 427.4x faster (range 32.2x-735.0x).

PDF Outperforms GDF on Smaller Data Frames

·       When selecting the 5 smallest and largest values for a column, the PDF outperformed the GDF on the 0.631 million, 5 million, and 10 million row data frames with the performance decreasing as the data frame became larger (range 1.1x to 12.7x).  On the larger data frames, the GDF performed best and the performance increased as the data frame became larger (range 1.8x-3.3x).

·       When adding a blank column, the PDF outperformed the GDF only on the 0.631 million row data frame.

Summary

As shown above, dataframes that run on the GPU can often speed up processes that manipulate the dataframe by 10x to over 1,000x when compared to dataframes that run on the CPU, but this is not always the case.  There is also a tradeoff in which smaller dataframes perform better on the CPU while larger dataframes perform better on the GPU.  The syntax for using a GDF is slightly different than using a PDF, but the learning curve is not steep, and the effort is worth the reward.  I’m going to try the same benchmarking on some other data sets and use some other methods to see how the results compare.  Stay tuned for the next installation.

Follow me on Twitter and on LinkedIn!