Accelerating Graph Sampling on GPUs using NextDoor

Introduction

Our system is publicly available at https://github.com/plasma-umass/nextdoor/ . Our experimental setup is available at https://github.com/abhijangda/nextdoor-experiments/ .

Availability

NextDoor: NextDoor is publicly available at https://github.com/plasma-umass/nextdoor/ . This repo contains the source code of the NextDoor and implementation of all 10 sampling algorithms described in the paper (DeepWalk, PPR, Node2Vec, MultiRW, KHop, Layer, ClusterGCN, MVS, FastGCN, and LADIES).

README.md file explains the source code, points to files containing the implementation of these algorithms in NextDoor, how to execute these implementations, and how to generate the Python 2/3 modules of these implementations.

Writing New Sampling Algorithm in NextDoor: NextDoor’s public repository https://github.com/plasma-umass/nextdoor/ describes an example application that users can follow to create and execute a new sampling algorithm. README.md describes the process.

Datasets: All graphs used in our evaluation (PPI, Reddit, Patents, Orkut, LiveJournal1) are available in a ZIP file at https://drive.google.com/file/d/19duY39ygWS3RAiqEetHUKdv-4mi_F7vj/view?usp=sharing .

Benchmarks: We use 10 sampling algorithms as benchmarks. Implementations of these algorithms in NextDoor are available in src/apps directory. We also present the NextDoor’s implementations of sampling algorithms integrated in existing GNNs at https://github.com/abhijangda/nextdoor-experiments/ . README.md describes steps to execute this integration for all GNNs. Moreover, our benchmarking infrastructure, i.e., scripts to execute these experiments are available in same repo.

Baselines: We have two baselines. First, KnightKing, which is a system to describe and execute random walks. It is available at https://github.com/KnightKingWalk/KnightKing . Second, we use the samplers in existing GNNs as baselines. We have collected all these GNNs implementations in a single repository https://github.com/abhijangda/nextdoor-experiments/. We provide scripts to evaluate these baselines.

Functionality

Prerequisites

Linux Installation: We recommend using Ubuntu 18.04 as the Linux OS. We have not tested our artifact with any other OS but we believe Ubuntu 20.04 should work.

Install Dependencies: Execute following commands to install dependencies.

sudo apt update && sudo apt install gcc linux-headers-$(uname -r) make g++ git python-dev python3-dev wget unzip python3-pip cmake openmpi* libopenmpi* libmetis-dev

sudo pip3 install virtualenv

Install CUDA: We need to install CUDA before proceeding further. In our experiments we used CUDA 11.0 on Ubuntu 18.04. NextDoor uses other CUDA libraries like cuRAND and CUB. These libraries are not available as part of CUDA before version 11. Hence, we recommend using CUDA 11.0 because this will make the build process easier. CUDA 11.0 toolkit can be downloaded from https://developer.nvidia.com/cuda-11.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

While installing CUDA please make sure that you install CUDA samples in your home directory and CUDA is installed in /usr/local/cuda. Alternatively, CUDA samples can be copied from /usr/local/cuda/samples/.

Check CUDA Installation: To check CUDA installation, go to CUDA samples installed in your home directory and execute following commands:

cd ~/NVIDIA_CUDA-11.0_Samples/1_Utilities/deviceQuery

make

./deviceQuery

Executing this deviceQuery command will show the information about GPUs in your system. If there is any error then either CUDA device is not present or CUDA driver is not installed correctly.

Set NVCC Path and CUDA libraries path: We assume that nvcc is present in /usr/local/cuda/bin/nvcc. Please make sure that this is a valid path and this nvcc is from CUDA 11.0 by using nvcc --version. Then export this in your PATH variable. Also update LD_LIBRARY_PATH to include CUDA libraries.

export PATH="/usr/local/cuda/bin:$PATH"

export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

Create Parent Directory for Evaluation:

For functionality and reproducibility create a top level directory that contains NextDoor and its experiments. Let say this directory is NextDoorEval.

mkdir NextDoorEval

export TOP_DIR=`pwd`/NextDoorEval

export NEXTDOOR_DIR=$TOP_DIR/NextDoor

Clone NextDoor and Download Dataset: We must clone the repository and download the Graph dataset from https://drive.google.com/file/d/19duY39ygWS3RAiqEetHUKdv-4mi_F7vj/view?usp=sharing . This zip file contains five graphs (PPI, Reddit, Orkut, Patents, LiveJournal1). Extract this zip file in input directory in the NextDoor’s repo clone directory. Following commands performs these operations:

cd $TOP_DIR

git clone --recurse-submodules https://github.com/plasma-umass/NextDoor.git

cd NextDoor

git submodule update --init --recursive

./downloadDataset.sh

unzip input.zip

Executing Example Application

After setting up, we are now ready to execute sampling algorithms using NextDoor. NextDoor’s repository describes several sampling algorithms.

Uniform Random Walk application is described in example directory in the repository. File uniformRandWalk.cu shows this implementation. The implementation can be compiled by doing make . Produced binary (uniformRandWalk) can be executed to perform random walk on a graph. Following commands shows how to build this example, and execute it on PPI graph using Transit Parallel technique:

cd $NEXTDOOR_DIR/example

make

./uniformRandWalk -g ../input/ppi.data -t edge-list -f binary -n 1 -k TransitParallel -l

Command Line Arguments are:

-g is path to graph,
-t is the type of storage of graph, which can be either a list of edges (edge-list) or adjacency list (adj-list),
-f is the format of storage, which can be either binary or text,
-n describes number of times to run it
-k describes the kind of kernel to use Transit Parallel or Sample Parallel
-l describes if load balancing, caching, and other optimizations must be applied on Transit Parallel. This argument is valid for Transit Parallel only.

After executing we expect an output like below. This output shows the number of vertices (56944) and edges(1612348) of graph, one or more GPUs used by NextDoor (GPU 0), the size of sample/walk (100), number of samples to generate (56944), End to End Time (0.020685), the break down of execution time, and the total number of vertices sampled in all walks (5694400).

Graph Binary Loaded

Graph has 1612348 edges and 56944 vertices

Using GPUs: [0,]

Final Size of each sample: 100

Maximum Neighbors Sampled at each step: 1

Number of Samples: 56944

Maximum Threads Per Kernel: 57088

Transit Parallel: End to end time 0.020685 secs

InversionTime: 0.0102875, LoadBalancingTime: 0.00653291, GridKernelTime: 7.84397e-05, ThreadBlockKernelTime: 4.05312e-06, SubWarpKernelTime: 2.14577e-06, IdentityKernelTime: 0.00361753

totalSampledVertices 5694400

Using Python Modules: Additionally, the example contains (uniformRandWalk.cu) Python 2 and Python 3 modules for invoking NextDoor’s implementation. Doing make will build both modules. Python2 module is called UniformRandomWalkPy2 and Python 3 module is called UniformRandomWalkPy3. Below shows an example of how Python 2 can be used to invoke UniformRandomWalkPy2’s functions.

$ python2

Python 2.7.17 (default, Sep 30 2020, 13:38:04)

[GCC 7.5.0] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import UniformRandomWalkPy2

>>> UniformRandomWalkPy2.initSampling("../input/ppi.data")

Graph Binary Loaded

Using GPUs: [0,]

Final Size of each sample: 100

Maximum Neighbors Sampled at each step: 1

Number of Samples: 56944

Maximum Threads Per Kernel: 57088

>>> UniformRandomWalkPy2.sample()

SampleParallel: End to end time 0.00748205 secs

Both modules defines following functions.

initSampling(graph): It takes a path to graph that is stored in a binary edge-list format.
sample(): This will perform the sampling
finalSamples(): This will return the final samples in the form of a Python List.
finalSampleLength(): This will return the size of each sample.
finalSamplesArray(): This avoids building a list and gives a pointer to the final samples array of NextDoor. This pointer can then be converted to a numpy array.

The above information for these functions are also available as docStrings, which can be accessed by <module-name>.__doc__ or <module-name>.<function>.__doc__.

Executing All Other Sampling Algorithms

In the repository, src/apps/ directory contains implementations of all 10 sampling algorithms. The randomwalks directory contains implementation of DeepWalk, PPR, and Node2Vec. Other sampling algorithms are in their own directory. To execute a sampling algorithm, we need to follow similar commands as above. For example, to execute FastGCN:

cd $NEXTDOOR_DIR/src/apps/fastgcn/

make -j

./FastGCNSampling -g ../../../input/ppi.data -t edge-list -f binary -n 1 -k TransitParallel -l

Using Python Modules: Alternatively, users can compile these applications in a Python 2 or Python 3 Module and call from there. We have predefined modules for LADIES, FastGCN, ClusterGCN, and KHop. These modules are meant to be used to integrate NextDoor’s implementation of sampling algorithms in PyTorch or Tensorflow implementation of GNNs.

For example, to use FastGCN from Python 3, execute following commands:

cd $TOP_DIR/NextDoor/src/apps/fastgcn/

make

This will generate a “FastGCNSamplingPy3.so” file, which can be imported in Python3 and sampling can be done as follows:

$ python3

Python 3.6.9 (default, Oct 8 2020, 12:12:24)

[GCC 8.4.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> import FastGCNSamplingPy3

>>> FastGCNSamplingPy3.initSampling("../../../input/ppi.data")

Graph Binary Loaded

Using GPUs: [0,]

Final Size of each sample: 320

Maximum Neighbors Sampled at each step: 64

Number of Samples: 10000

Maximum Threads Per Kernel: 655360

>>> FastGCNSamplingPy3.sample()

SampleParallel: End to end time 0.0733922 secs

Collective Neighborhood Computing 0.0266759 secs

>>> FastGCNSamplingPy3.finalSamples()

Reproducing Results

We will describe how to reproduce key results of paper in Figure 6, Figure 7, Figure 8, Figure 10, Table 4, and Table 5.

Setting Up Baselines

We will now set up baselines: KnightKing and existing GNNs.

KnightKing: To setup KnightKing, clone from repository https://github.com/KnightKingWalk/KnightKing and compile using cmake. Below commands perform these operations:

cd $TOP_DIR

git clone https://github.com/KnightKingWalk/KnightKing.git --recurse-submodules

cd KnightKing

mkdir build && cd build

cmake .. && make -j

ls bin

Compiled binaries are in directory build/bin. This directory should contain three binaries: deepwalk, node2vec, and ppr.

Existing GNNs: We will now set up existing GNNs within the nextdoor-experiments repo. Execute below commands to clone nextdoor-experiments repo and setup GNNs.

cd $TOP_DIR

git clone https://github.com/abhijangda/nextdoor-experiments.git --recurse-submodules

cd nextdoor-experiments

./setup.sh

export NEXTDOOR_EXP_DIR=$TOP_DIR/nextdoor-experiments

Testing NextDoor

Setting up Google Test: We need to setup Google Test. Within the NextDoor directory, execute following commands to build googletest.

cd $NEXTDOOR_DIR/googletest

mkdir build

cd build

cmake .. && make -j

Building Tests: To build all single GPU and multi GPU tests execute following command in NextDoor directory:

cd $NEXTDOOR_DIR

make -j

Testing: We recommend testing DeepWalk before moving forward. To test an application, say DeepWalk, execute the following in the NextDoor directory.

cd $NEXTDOOR_DIR

./build/tests/singleGPU/deepwalk

This will show output like below:

[==========] Running 15 tests from 1 test suite.

[----------] Global test environment set-up.

[----------] 15 tests from DeepWalk

[ RUN ] DeepWalk.LiveJournalTP

Graph Binary Loaded

Graph has 68555726 edges and 4847569 vertices

Using GPUs: [0,]

Final Size of each sample: 100

Maximum Neighbors Sampled at each step: 1

Number of Samples: 4847569

Maximum Threads Per Kernel: 4847616

Transit Parallel: End to end time 0.268283 secs

InversionTime: 0.0716658, LoadBalancingTime: 0, GridKernelTime: 0, ThreadBlockKernelTime: 0, SubWarpKernelTime: 0, IdentityKernelTime: 0

totalSampledVertices 172135742

checking results

Adj Matrix of Graph Created

[ OK ] DeepWalk.LiveJournalTP (17441 ms)

If any error comes please let us know over EuroSys HotCrp.

Performance Evaluation of Sampling Algorithms [Total Time: 4 Hours]

System Requirements: For performance evaluation following requirements are needed.

CPU and RAM: NextDoor and KnightKing do not require more than 10 GB of RAM. However, running existing GNNs on large graphs (patents, livejournal, and orkut) requires upto 40 GB of RAM. For performance evaluation against CPU baselines, (KnightKing and all GNNs), we used two 16-core Intel Xeon Silver CPUs. Hence, we recommend that the reviewer do performance evaluation on similar systems.
GPU: In our paper we performed evaluation on NVIDIA Tesla V100 for single GPU results and 4 NVIDIA Tesla V100 for multiple GPU results. We also did performance evaluation on a NVIDIA GeForce GTX 1080Ti and found that performance dropped by a factor of 2 as compared to Tesla V100. In this case, We still obtained orders of magnitude improvement over CPU baselines. Hence, the evaluation of NextDoor vs CPU baselines (in Figure 7a and Figure 7b) will drop if a GPU with less execution resources than Tesla V100 is used. However, we still expect orders of magnitude improvement over CPU baselines and most of the other results in Figure 6 should show similar improvement over SP and TP. We require the GPU to have atleast 8 GB of available RAM.
Disk Space: Our experiments will need atleast 30 GB of disk space.

We will now do performance evaluation of sampling implementations in NextDoor against the baselines to reproduce Figures 6 and 7 and Table 5.

In some cases you can see more improvement than presented in the paper. This is because we have been working on a new and optimized implementation of NextDoor.

Before doing performance evaluation, (i) follow all steps of the “Prerequisite” section to clone NextDoor, and obtain datasets, (ii) follow all steps of “Setting Up Baselines” section to obtain the baselines, and (iii) follow all steps of “Testing NextDoor” section to build tests. We will use scripts available in nextdoor-experiments for benchmarking. Hence, all scripts must be executed in the order below.

To keep the packing simple we updated the implementation of these GNNs to use the latest version of NumPy, PyTorch and Tensorflow, so, that we have only a single CUDA version, i.e. CUDA 11.0. Unfortunately, with this new version they have added a restriction that each tensor must be of length less than 2GB. This is not possible for large graphs like Orkut and LiveJournal. Hence, for following GNNs we get Out of Memory error.

GraphSAGE, MVS, and ClusterGCN gives out of Memory on Orkut and LiveJournal
GraphSAINT, FastGCN, and LADIES gives out of Memory on Orkut

We will update the results in the paper to reflect this change.

Reproduce Table 5: We will reproduce end to end speedup experiments for GraphSAGE, FastGCN, and LADIES. Unfortunately, we couldn't get ClusterGCN integration to run before the submission deadline. For GraphSAGE we found that newer versions of Numpy and Tensorflow do not allow allocating tensors of size more than 2 GB. Hence, for Orkut and LiveJournal graph, we report OOM for GraphSAGE. We will update this in the final version of our paper.

This experiment will take upto 1 hour to execute.

Within the nextdoor-experiments directory execute below command and provide the absolute path to nextdoor.

cd $NEXTDOOR_EXP_DIR

python3 runEndToEnd.py -nextdoor $NEXTDOOR_DIR

Results: This script will print the results of evaluation.

Reproduce Figure 6, 7, and 10: We will now reproduce the numbers in Figure 6, 7, and 10. Following commands will execute KnightKing, existing GNNs, sampling applications in NextDoor and produce the numbers in a tabular format. There are two scripts to reproduce results.

First script is runGNNSampling.py . It executes all GNNs and stores the results. This command can take upto 2 hours and require upto 40 GB of RAM to do performance evaluation for large graphs (Orkut, Patents, and LiveJournal). Evaluation on small graphs (PPI and Reddit) take upto 10 minutes. To use only small graphs set -useSmallGraphs True command line argument.

cd $NEXTDOOR_EXP_DIR

python3 runGNNSampling.py -nextdoor $NEXTDOOR_DIR

Second script is runBenchmarks.py which executes KnightKing, NextDoor on single GPU, and Nextdoor on multiple GPUs. By default NextDoor will be executed on GPU with ID 0. To use multiple GPUs, use -gpu argument to list all GPUs. Following is the example of this script. This script will take upto 30 mins:

cd $NEXTDOOR_DIR

make -j

cd $NEXTDOOR_EXP_DIR

python3 runBenchmarks.py -knightKing $TOP_DIR/KnightKing -nextdoor $NEXTDOOR_DIR -gpus 0,1,2,3

Note that the above command assumes 4 GPUs with IDs 0,1,2,3. To use different numbers of GPUs please modify this list. If only one GPU is specified then multiple GPU results are not taken.

Reprint Previous Results: Each invocation of this script stores the results in benchmarkResults.json file in nextdoor-experiments directory. To read these results and print them, use the optional flag -printExistingResults true.

Results: runBenchmarks.py will produce results for Figure 7a, 7b, 7c, Figure 6, and Figure 10. If -useSmallGraphs True is used in runGNNSampling.py then results of Figure 7b will be produced only for PPI and Reddit graphs. With larger graphs the speedup of NextDoor over existing GNNs increases exponentially.

Reproduce Figure 8 and Table 10: We will now reproduce the numbers in Figure 8 and Table 10. These results are about four performance metrics. To obtain these results we will use script runPerfAnalysis.py in $NEXTDOOR_EXP_DIR.

Script requires path to nvprof, which is usually available in /usr/local/cuda/bin/nvprof. Script will need sudo access because nvprof do not profile kernels without sudo access. To obtain Figure 8 results for L2 Cache Transactions execute below command:

cd $NEXTDOOR_EXP_DIR

python3 runPerfAnalysis.py -nextdoor $NEXTDOOR_DIR -nvprof /usr/local/cuda/bin/nvprof -metric l2_read_transactions

To obtain Table 4 results for Global Store Efficiency, execute below commands:

cd $NEXTDOOR_EXP_DIR

python3 runPerfAnalysis.py -nextdoor $NEXTDOOR_DIR -nvprof /usr/local/cuda/bin/nvprof -metric gst_efficiency

To obtain Table 4 results for Multiprocessor Activity, execute below commands:

cd $NEXTDOOR_EXP_DIR

python3 runPerfAnalysis.py -nextdoor $NEXTDOOR_DIR -nvprof /usr/local/cuda/bin/nvprof -metric sm_efficiency