Skip to content

Latest commit

 

History

History
114 lines (91 loc) · 4.73 KB

README.md

File metadata and controls

114 lines (91 loc) · 4.73 KB

nestedloopsfusion

LLVM branching optimization transformation pass for GPUs

To compile the Loops Fusion transformation pass, you have to get a working and up-to date version of LLVM/Clang.

Build LLVM

Assume you compiled Clang to be installed as a local user, using CMake config like this (instructions valid for LLVM 9.0):

cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=debug -DLLVM_ENABLE_PROJECTS="clang;llvm;clang-tools-extra;compiler-rt" -DCMAKE_INSTALL_PREFIX=/home/username/local -DLLVM_TARGETS_TO_BUILD="AMDGPU;NVPTX;X86;WebAssembly" ../llvm-project/llvm/

Build LoopF transformation pass module

After that, the pass can be compiled as follows (assuming you're at the root of this repository):

mkdir build
cd build
env CC=clang -CMAKE_PREFIX_PATH=/home/username/local -DCMAKE_INSTALL_PREFIX_PATH=/home/username/local ../
make

After that, LoopF llvm pass will become available as a plugin module for opt utility.

Build cudatest

Assuming you already have CUDA installed, cudatest benchmark can be compiled like this:

clang++ cudatest.cu  -L/usr/local/cuda/lib64/ -I/usr/local/cuda/samples/common/inc -lcudart_static -ldl -lrt -pthread    -o cudatest_ref --cuda-gpu-arch=sm_70

To test it out, run cudatest_ref 1 1234 3000 31 2 It should print something like:

 ...
 356379
 Time Sum Avg Avgt/elem 1.949342 15047055 470220 241219.915248

Applying transformation pass to the benchmark To build the transformed version of cudatest, you first have to create a build script that is based on Clang's compilation process. To do that, we can run the compilation command with -### option that tells clang to just print the compilation commands it is about to run, instead of running them.

clang++ cudatest.cu  -L/usr/local/cuda/lib64/ -I/usr/local/cuda/samples/common/inc -lcudart_static -ldl -lrt -pthread    -o cudatest_transformed --cuda-gpu-arch=sm_70 -O3 -save-temps -### 2> ./make_transformed.sh

Now open this make_transformed.sh. It will look like this:

clang version 9.0.0 (https://github.com/llvm/llvm-project.git 635b988578505eee09ff304974bc2a72becb66d3)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/username/local/bin
"/home/username/local/bin/clang-9" "-cc1" ..
"/home/username/local/bin/clang-9" "-cc1" ..
"/home/username/local/bin/clang-9" "-cc1" ..
"/usr/local/cuda-10.1/bin/ptxas" "-m64" ..
"/usr/local/cuda-10.1/bin/fatbinary" "--cuda" ..
"/home/username/local/bin/clang-9" "-cc1" ..
"/home/username/local/bin/clang-9" "-cc1" ..
"/home/username/local/bin/clang-9" "-cc1" ..
"/home/username/local/bin/clang-9" "-cc1as" ..
"/usr/bin/ld" "-z" "relro" "--hash-style=gnu" ..

Now you have to remove " symbols from this file and make it a bash script, something like this:

#!/bin/bash
/home/username/local/bin/clang-9 -cc1 .. 
/home/username/local/bin/clang-9 -cc1 .. 
/home/username/local/bin/clang-9 -cc1 .. 
/usr/local/cuda-10.1/bin/ptxas -m64 ..
/usr/local/cuda-10.1/bin/fatbinary --cuda ..
/home/username/local/bin/clang-9 -cc1 ..
/home/username/local/bin/clang-9 -cc1 ..
/home/username/local/bin/clang-9 -cc1 ..
/home/username/local/bin/clang-9 -cc1as ..
/usr/bin/ld -z relro --hash-style=gnu ..

Make it executable chmod +x make_transformed.sh and run it to test if compilation script works.

Now, the transformation command should be injected into this script after the second line, like this:

#!/bin/bash
/home/username/local/bin/clang-9 -cc1 .. 
/home/username/local/bin/clang-9 -cc1 ..  -o cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc ..
opt -load ../build/LoopF/libLoopF.so -simplifycfg -loop-rotate -loopf cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc |opt -O3 > cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc_mod
mv cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc_orig
mv cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc_mod cudatest-cuda-nvptx64-nvidia-cuda-sm_70.bc
/home/username/local/bin/clang-9 -cc1 .. 
/usr/local/cuda-10.1/bin/ptxas -m64 ..
/usr/local/cuda-10.1/bin/fatbinary --cuda ..
/home/username/local/bin/clang-9 -cc1 ..
/home/username/local/bin/clang-9 -cc1 ..
/home/username/local/bin/clang-9 -cc1 ..
/home/username/local/bin/clang-9 -cc1as ..
/usr/bin/ld -z relro --hash-style=gnu ..

After that, you can run make_transformed.sh to produce cudatest_transformed which is the same program as cudatest_ref, with its GPU kernel code transformed with Nested Loops Fusion transformation pass. You can run it with the same parameters as original cudatest_transformed 1 1234 3000 31 2, but now it will finish much faster and show better benchmark values:

...
  356379
 Time Sum Avg Avgt/elem 0.111188 15047055 470220 4229054.838467