This repository contains CUDA code for performing matrix-vector multiplication using row-wise decomposition. The CUDA kernel launches multiple threads to efficiently compute the result in parallel on a GPU.
The code calculates the speedup for matrix-vector multiplication with varying thread configurations and prints the speedup for each configuration. The test was started from threads_per_block = 32 to 32*20.
Speed up vs number of threads per block:
We can clearly see that the speedup decreases as the number of threads per blocks increased.
The code was run on the wes-00-00
GPU node of Wesley.
- NVIDIA GPU with CUDA support
- CUDA Toolkit installed
- C compiler (e.g., GCC) for compiling host code
sshpass
utility for password-based SSH authentication (install usingsudo apt-get install sshpass
)
Compile the code using the provided Makefile:
nvcc -g -G script.cu -o script
This will generate an executable named matrix_mult
.
Run the executable, providing the number of threads as an argument:
./matrix_mult
Replace <threadsnum>
with the desired number of threads per block. The program will calculate the parallel execution time and print the speedup for each thread configuration.
To run the program with 32 threads per block:
./matrix_mult 32
In order to submit a job in the interaction section:
qsub -I -l host=wes-00-00
I provided a bash script that automates the process of compiling and executing the CUDA code on the specified remote server with the provided file and input value 'n'
Contributions are welcome! Feel free to open an issue or submit a pull request for any improvements or bug fixes.
Ensure to provide clear instructions on how to use the script and what parameters it expects. Adjust the file paths and server credentials in the script according to your environment.