A Study in Parallel Algorithms : Stream Compaction
Many of the algorithms you have learned thus far in your career have typically been developed from a serial standpoint. When it comes to GPUs, we are mainly looking at massively parallel work. Thus, it is necessary to reorient our thinking. In this project, we will be implementing a couple different versions of prefix sum. We will start with a simple single thread serial CPU version, and then move to a naive GPU version. Each part of this homework is meant to follow the logic of the previous parts, so please do not do this homework out of order.
This project will serve as a stream compaction library that you may use (and will want to use) in your future projects. For that reason, we suggest you create proper header and CUDA files so that you can reuse this code later. You may want to create a separate cpp file that contains your main function so that you can test the code you write.
Stream compaction is broken down into two parts: (1) scan, and (2) scatter.
Scan or prefix sum is the summation of the elements in an array such that the resulting array is the summation of the terms before it. Prefix sum can either be inclusive, meaning the current term is a summation of all the elements before it and itself, or exclusive, meaning the current term is a summation of all elements before it excluding itself.
Inclusive:
In : [ 3 4 6 7 9 10 ]
Out : [ 3 7 13 20 29 39 ]
Exclusive
In : [ 3 4 6 7 9 10 ]
Out : [ 0 3 7 13 20 29 ]
Note that the resulting prefix sum will always be n + 1 elements if the input array is of length n. Similarly, the first element of the exclusive prefix sum will always be 0. In the following sections, all references to prefix sum will be to the exclusive version of prefix sum.
The scatter section of stream compaction takes the results of the previous scan in order to reorder the elements to form a compact array.
For example, let's say we have the following array: [ 0 0 3 4 0 6 6 7 0 1 ]
We would only like to consider the non-zero elements in this zero, so we would like to compact it into the following array: [ 3 4 6 6 7 1 ]
We can perform a transform on input array to transform it into a boolean array:
In : [ 0 0 3 4 0 6 6 7 0 1 ]
Out : [ 0 0 1 1 0 1 1 1 0 1 ]
Performing a scan on the output, we get the following array :
In : [ 0 0 1 1 0 1 1 1 0 1 ]
Out : [ 0 0 0 1 2 2 3 4 5 5 ]
Notice that the output array produces a corresponding index array that we can use to create the resulting array for stream compaction.
Given the definition of exclusive prefix sum, please write a serial CPU version of prefix sum. You may write this in the cpp file to separate this from the CUDA code you will be writing in your .cu file.
We will now parallelize this the previous section's code. Recall from lecture that we can parallelize this using a series of kernel calls. In this portion, you are NOT allowed to use shared memory.
- Compare this version to the serial version of exclusive prefix scan. Please include a table of how the runtimes compare on different lengths of arrays.
- Plot a graph of the comparison and write a short explanation of the phenomenon you see here.
-
My gpu accelerated algorithm actually runs much slower than my sequential cpu algorithm. I think this is due to a poor implementation of the naive algorithm on my part. The way my naive algorithm works is to so a segmented scan and save the largest value in each segment to an auxiliary array. I then call scan recursively on the auxiliary array. My fatal mistake was that I allocated the memory for the auxiliary array within the naive_scan call. The result of this is that for an array of length N i have log(n) calls to malloc which is pretty slow. The sequential version already has everything in memory so it runs much more quickly. I'm pretty ceratin if I were to initialize all the memory outside the call I would see significant performance improvement and I plan to do this before using the algorithm in the next assignment.
-
There are also some other places I could get some speedup, for example in my kernel call I actually move data between my two temporary arrays inside the loop when I could just swap their pointers.
-
My graph is in string compaction graph.pdf
In the previous section we did not take into account shared memory. In the previous section, we kept everything in global memory, which is much slower than shared memory.
Shared memory is accessible to threads of a block. Please write a version of prefix sum that works on a single block.
Taking the previous portion, please write a version that generalizes prefix sum to arbitrary length arrays, this includes arrays that will not fit on one block.
- Compare this version to the parallel prefix sum using global memory.
- Plot a graph of the comparison and write a short explanation of the phenomenon you see here.
- I do see some speedup, however my implementation is still far worse than CPU. This is again due to having allocation inside my recursive call (which was a stupid idea). I plan to fix this before using this algorithm.
- I did see performance improvement in this version over the last one. This is because the loop within my kernel calls now only has to go to shared memory instead of out to global memory.
- My graph is in string compaction graph.pdf
First create a serial version of scatter by expanding the serial version of prefix sum. Then create a GPU version of scatter. Combine the function call such that, given an array, you can call stream compact and it will compact the array for you. Finally, write a version using thrust.
- Compare your version of stream compact to your version using thrust. How do they compare? How might you optimize yours more, or how might thrust's stream compact be optimized.
- Unfortunately I spent many hours debugging and was not able to implement a version in thrust, however I'm quite certain their implementation is superior to mine in every way. They probably have done all sorts of tricks to squeeze out extra performance such as doing a work-efficient algorithm without bank conflicts. They also probably packed more data into registers to do more work per thread.
For extra credit, please optimize your prefix sum for work parallelism and to deal with bank conflicts. Information on this can be found in the GPU Gems chapter listed in the references.
Please answer all the questions in each of the subsections above and write your answers in the README by overwriting the README file. In future projects, we expect your analysis to be similar to the one we have led you through in this project. Like other projects, please open a pull request and email Harmony.
"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.