diff --git a/README.md b/README.md index 0e38ddb1..85051c08 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,190 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Griffin Evans + * gpevans@seas.upenn.edu, [personal website](evanses.com/griffin) +* Tested on lab computer: Windows 11 Education, i9-12900F @ 2.40GHz 64.0GB, NVIDIA GeForce RTX 3090 (Levine 057 #1) -### (TODO: Your README) +# Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project is an implementation and demonstration of several implementations of exclusive scan and stream compaction algorithms. Our scan implementations all take in an array of integers `idata` and output to each index of the array `odata` the sum of all of the values in `idata` with lower indices. Our stream compaction functions take in an array of integers `idata` and output to `odata` an array of the same integers but with all elements that equal 0 removed, returning the number of elements in this resulting array. +Four implementations of scan are featured: the first is a CPU-side implementation which sequentially steps through the input array, then we have 3 GPU-based implementations using CUDA. The first GPU-based implementation (the "naïve" implementation) is based on the non-work-efficient algorithm from Example 2 in [chapter 39 of _GPU Gems 3_](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), while the second GPU-based implementation is based on the work-efficient approach from Examples 3 and 4 in the same chapter. The last GPU-based approach simply leverages the existing Thrust library's implementation of exclusive scan. + +Three implementations of stream compaction are featured: the first is a straightforward CPU-based implementation, the second is a CPU-based implementation which divides the process into separate map/scan/scatter steps, and the third is a GPU-based implementation which uses the work-efficient scan implementation alongside GPU-side map and scatter kernels. + +Building the project and running `main()` will run each of the implementations on an array of random values, outputting the results and the time each implementation took to run. + +## Setup Notes: + +I modified stream_compaction/CMakeLists.txt as [described here](https://edstem.org/us/courses/81464/discussion/6937028?answer=16157632); adding `find_package(CCCL REQUIRED)` and `target_link_libraries(stream_compaction CCCL::Thrust)` to fix the thrust library not properly being included. + +The process of building the project was as follows: + +1. Clone the project to a folder and navigate into it in Bash. +2. `mkdir build` +3. `cd build` +4. `cmake-gui ..` +5. Within CMake GUI, configure the project, selecting Visual Studio 17 2022 as the generator and x64 as the platform. +6. Select "generate". +7. Open the generated `.sln` file within the build folder in Visual Studio. +8. Build the project from the toolbar in Visual Studio. + +# Questions and Analysis + +## Performance Plots + +All three plots feature array size on the x-axis, and the total running time of the implementation (disregarding initial/final memory operations) logarithmically on the y-axis. + +![](img/plot1.png) + +For smaller array sizes, the CPU implementation of scan is plenty efficient and outperforms all three of the GPU-based implementations. However, the time the CPU-based implementation takes increases essentially linearlly with array size (note that while the y-axis of the plot is logarithmic, the x-axis is as well due to the use of powers of two such that each data point shown on the chart has 4 times the array size of the point to its left, meaning that a linear correlation between the x and y values will ultimately appear as a straight line), whereas the GPU-based implementations all show a much more gradual increase (in fact largely just fluctuating rather than consistently increasing for the lower array sizes). Because of this slower increase, when we get to the size of $2^{16}$ the Thrust-based implementation becomes slightly faster than the CPU-based implementation (up to that point having been faster than the other GPU-based implementations but still slower than CPU). Oddly though when we increase the array size again to $2^{18}$ the Thrust-based implementation briefly loses its spot as the fastest GPU-based approach, still being marginally faster than the CPU and work-efficient implementations (which are now almost exactly tied in run time) but being slower than the naïve GPU-based implementation, which seems to keep similar performance to what it had in the smaller array sizes. At $2^{20}$ the work-efficient GPU implementation finally beats the CPU implementation in speed, while staying close to the Thrust implementation's performance, but the naïve implementation is still slightly faster than both. The naïve implementation's performance starts significantly degrading at $2^{22}$, becoming slower than both of the other GPU-based implementations, and increasing in runtime essentially linearlly with array size beyond that point. The work-efficient and Thrust-based implementations both seem to remain more gradual in runtime increase for slightly longer, but around $2^{26}$ onwards they seem to show a similar linear trend. + +The consistently linear increase in runtime for the CPU-based approach makes intuitive sense as the number of elements that need to be iterated over in series directly matches the size of the input array. The GPU-based approaches of course all benefit from parallelism such that we see much more minor performance detriments until we reach array sizes large enough that we are severely hampered by hardware limitations (in regards to e.g. the number of warps we can process concurrently), hence the gradual slopes in the left sections of the graph. However the benefits of this parallelism do not seem very significant for the smaller array sizes, perhaps being outweighed by the overhead of the more complicated structure of the algorithms (requiring more operations even if those operations are performed in parallel, and needing additional values to be tracked for e.g. the varying maximum index used in the work-efficient approach, or even needing memory allocation to be performed in the Thrust-based implementation—see Nsight Analysis section below), and hence the GPU approaches are slower for the smaller sizes. + +As for why the GPU-based approaches vary so sporadically in regards to comparable performance, particularly around $2^{18}$, I don't have as clear an answer but I suspect is would have something to do with the resources used by the different approaches hitting particular thresholds at different times. For instance while the naïve implementation has the downside of consistently launching the same number of blocks (and thus warps/threads) in each step, whereas the work-efficient approach launches fewer blocks in the later up-sweeps and earlier down-sweeps, up to a certain number of elements this may not actually make a significant difference as we may have SMs capable of running a large portion of those blocks/warps/threads all concurrently, meaning it's only once the array size increases further (and hence further increases the number of threads needed) that we see the hardware limits come into affect, and since the size is consistently that large throughout the naïve process's execution it becomes much more drastically affected in terms of performance by those limits. + +![](img/plot2.png) + +While the overall trend remains similar when looking at non-power-of-two array sizes (here each three less than a power of two), the testing here shows some odd spikes in performance. For instance, Thrust performs significantly better at $2^{16}-3$ than at $2^{16}$, but then jumps up to around the same run time at $2^{18}-3$ as at $2^{18}$, while the naïve and work-efficient approaches seem to consistently peform better on the smaller sizes up until around $2^{22}-3$, in fact leading to the work-efficient approach outperforming thrust for $2^{18}-3$ and $2^{20}-3$ despite never beating it in the power-of-two size tests. This difference seems especially odd for the work-efficient approach given that it pads the size of the array used to the next power of two, filling the last few values with 0s. This seems to suggest that the GPU's operations might be more performant on values of 0 specifically, and hence effectively replacing the last few values of the array with all 0s means the algorithm runs faster despite effectively having the same length of input. + +![](img/plot3.png) + +While we do not test it with as many GPU-based implementations, we can see similar trends in stream compaction performance as compared to scan. The CPU-based approach that splits it into map/scan/scatter steps performs consistently worse than the more natural sequential implementation, of course as the splitting causes additional work to be done without conferring particular benefit to the running of the CPU-side code. The GPU-based approach, which similarly splits up the steps but performs each of them using CUDA kernels, again performs worse than the CPU approaches for small array sizes, but as the array size increases to around $2^{18}$ it keeps a steady performance while the CPU versions become slower and slower, only starting to increase in run time at a similar rate relative to array size around $2^{24}$, still staying signficantly faster than the CPU versions. + +## Work-efficient Parallel Scan Implementation Details (re Part 5) + +In order to ensure the work-efficient implementation of scan was performant, a few considerations were made. Firstly note that in both the up-sweep and down-sweep kernels, in order to set the $k$ value (used in the indices of elements to be summed) which ranges from $0$ to $n-1$ (where $n$ is the number of elements in the padded array) in steps of $2^{d+1}$, we take the index of the thread (calculated from `threadIdx`, `blockIdx`, and `blockDim`), compare it to a maximum value and if it is greater than or equal to that value then we return, and if it isn't then we scale that index by multiplying it by $2^{d+1}$ and use that as our $k$. Note that the maximum needs to be compared before the multiplication is done, as otherwise with large enough array sizes we may have our $k$ value overflow and wrap around to a negative value. + +For the up-sweep, since the step size $2^{d+1}$ doubles in each iteration, we halve this maximum value each time, starting at $n/2$ as the first step size is $2^{0+1} = 2$. This means in each iteration we have half the number of threads which need to perform work, and the rest of the threads can terminate early upon reaching that maximum index check. Because all of the threads that have work required are sequential in index (their indices starting at 0 and going up by 1 to our maximum index at that step) we should have at most one warp which will have some threads working while others try to terminate early, while all other warps either have all threads doing work or all threads terminating early. In fact since we know the higher-index threads will not have to do any work, we can reduce the number of blocks launched as we reduce this maximum index, such that we completely skip any blocks that would have no threads which actually perform work. + +For the down-sweep, as $d$ is now descending and $2^{d+1}$ halves each time, we instead double the maximum value each time, such that the number of threads needing to do work doubles each time, and we similarly increase the number of blocks launched as the number of threads needed increases. + +An additional detail which was tested but ultimately is unused in this implementation was the reduction of size of the block once the number of threads needed to launch was low enough that we only had a single block. That is, when the number of threads that will do work is less than the default block size, the kernel would launch with a single block with a number of threads equal to the number of summations that will be done (also tested with that value rounded up to the next multiple of 32, to correspond with the size of warps). This however did not appear to cause any noticeable change in performance, and in testing various default block sizes it was found that starting all of the blocks with 32 threads (the size of a single warp) was most performant for running this implementation of scan on the hardware used—so since the blocks were single-warp-sized anyway (meaning there'd be no particular benefit to shrinking a block further, since 32 threads would make up the warp anyway), it was not necessary to use this size reduction behavior. + +Note that the block size of 32 for work-efficient scan was in constrast to the block sizes found to work best for the naïve implementation of scan and for the other steps of the GPU stream compaction implementation, which all seemed to perform the best with 512 threads per block. + + +## Nsight Analysis + +![](img/Overview1.png) + +![](img/Overview2.png) + +Looking at the timeline for our Thrust-based implementation in NSight Systems, we see a few main periods of activity that seem to be associated with the calls to the Thrust library—firstly corresponding to the construction of the device vectors, then to the scan function itself, then to the copy operation sending the resulting data to the CPU side. + +![](img/Zoom1b.png) + +This section seems to correspond to the construction of the device vectors. We construct two device vectors: `dv_in` which is populated with the elements from `idata` and `dv_out` which is a vector of the same length which will be filled with the results from the scan. We can see in the CUDA API row of Nsight systems a cudaMalloc call followed by a cudaMemcpyAsync call, which seems to correspond to the `dv_in` vector being created and the values from `idata` being copied into it from the CPU. We then have a cudaStreamSynchronize call followed by another cudaMalloc, this time not followed by a copy call as the `dv_out` vector is created with memory allocated to it but no values to be copied into it yet. + +![](img/Zoom1.png) + +That cudaMalloc precedes another cudaStreamSynchronize call, which seems to suggest it is synching to ensure the vectors are done being created before they are used. + +![](img/Zoom2b.png) + +We then see cudaEventCreate and cudaEventRecord calls, which match the usage of `timer().startGpuTimer()`, which creates an event to be later used to determine the run time of the function. This is followed by where we call the `thrust::exclusive_scan` function, which seems to correspond to a cudaMalloc call followed by the calling of DeviceScanInitKernel and DeviceScanKernel. + +![](img/Zoom2.png) + +After a cudaStreamSynchronize we have a cudaFree call before another cudaEventRecord call and a cudaEventSynchronize call, the latter two matching the calling of `timer().endGpuTimer()` and hence letting us see that all of the activity happening between this and the earlier cudaEventRecord is that of the `thrust::exclusive_scan` call. The presence of cudaMalloc and cudaFree within this span seems to suggest the execution of exclusive_scan involves allocating some sort of temporary memory buffer separate from the input and output vectors, unlike our implementations of scan which perform all of their operations directly between the input and output arrays in the case of the naïve implementation and acts in-place in the work-efficient implementation. I would assume this implementation uses shared memory to some degree, as that is a potential improvement which my work-efficient implementation does not account for, but I assume this memory usage here is something different as I believe shared memory usage wouldn't show up as a cudaMalloc call, seeing as `cudaMalloc()` is not the syntax used to allocate it and we allocate it at the launch of the kernel. That's not to say this is a sign the implementation doesn't use shared memory though (I would still guess that it does), but it does seem to imply there is also some other global memory usage in addition to that of the memory occupied by the two device vectors. + +![](img/Zoom2c.png) + +![](img/Zoom2d.png) + +The above two images show the execution of the aforementioned DeviceScanInitKernel and DeviceScanKernel, from which we can see that most of the execution time is taken up by the latter. Looking at the SM Warp Occupancy over time, the Compute Warps in Flight throughput percentage is 2% for most of the execution of DeviceScanInitKernel but increases to 82% for most of the DeviceScanKernel's execution—one can potentially guess that there is some sort of processing done in the "init" kernel which is less parallelizable, and hence we have this kernel do that initial work before starting another kernel which performs more parallelized work with much higher warp occupancy after. + +![](img/Zoom3.png) + +After the event recording is done and we are hence past our `timer().endGpuTimer()` call, we have a final cudaMemcpyAsync for the `thrust::copy` call which takes the data from the `dv_out` device vector and copies it into the CPU-side `odata` which is used to output the overall values from our `scan` function. After another cudaStreamSynchronize (ensuring the copying has finished) we have two cudaFree calls, which seem to be automatically performed by Thrust as it sees that the device vectors are now going to be out of scope and hence can be cleared from memory (compare the usage of something like raw pointers and `std::unique_ptr`—we don't need to explicitly write out the freeing of memory as we did in our other GPU-side scan implementations as the library does that for us). + +![](img/Screenshot%202025-09-15%20155306.png) + +For all of the implementations, initial and final memory operations (primarily calls to cudaMemcpy) take take longer than the execution of the algorithm itself, hence being left out of our timing comparisons for the algorithms themselves. + +![Naive implementation kernels](img/Naive1.png) + +As the naïve implementation here does not vary in number of threads between steps (with threads that aren't summing just copying values from idata to odata, rather than terminating early or not starting at all; the only threads that terminate early are those which have index greater than the total count of elements, i.e. only a few threads in one block and only any if the array size is non-power-of-two), the kernel used by it has a fairly consistent length of execution. + +![Work-efficient implementation kernels](img/WorkEfficient1.png) + +For the work-efficient implementation, the kernels vary in length of execution significantly likely because we both have more threads able to terminate early and we only launch blocks of threads up to the amount required for the work to actually be done in that step. The first calling of the up-sweep kernel takes the longest to complete, then each subsequent step becomes shorter as we reduce the number of threads needed, and inversely the down-sweep kernel starts the shortest then gets longer as we increase the number of threads used. + +![](img/ComputeWE1.png) + +![](img/ComputeWE2.png) + +Note though that because we end up launching very few blocks for many of the steps of the work-efficient implementation, we massively underutilize the GPU's multiprocessors (parallelization doesn't help much when e.g. performing an operation on just one value—using only a single thread is not really leveraging the GPU's abilities). Hence in the data from Nsight Compute as shown above we can see that though the Memory Throughput ranges from 67.42% to 93.27% for the first few calls of the up-sweep kernel, it begins to drop heavily as we progress, reaching 0.58% at the lowest, while for the down-sweep kernel it starts low (as we start with only a single block with a single thread that has work to do) at 0.98% then gradually increases. Compute Throughput appears relatively low throughout the process, having a max of 12.59% throughout the up and down sweep and being significantly lower at the same times the Memory Throughput is. + +![](img/ComputeNaive1.png) + +The naïve implementation has high Memory Throughput throughout, averaging 92.67%, while the Compute Throughput ranges from 15.27% to 23.30%. + +![](img/ComputeThrust1.png) + +The Thrust implementation has lower Compute and Memory Throughputs in its DeviceScanInitKernel then has high values in each for the DeviceScanKernel—here having 41.47% and 41.76% Compute Throughput and 91.37% and 91.67% Memory Throughput for executions on arrays of size $2^{29}$ and $2^{29}-3$ respectively. While the Compute Throughput is still much less than the Memory Throughput, it is far greater than in either of our implementations. + +Given all of these implementations have consistently higher Memory Throughput than Compute Throughput, I would suppose memory to be the main bottleneck—though in the case of the low-threadcount sections of the work-efficient approach, the GPU's resources are being underutilized to an extent where this is not so relevant to the overall performance (though for the first few up-sweeps and last few down-sweeps the performance still appears memory-bound). + +## Program Output: + +I didn't add any additional tests to the output here, but I did test compact with arrays not ending in 0, whereas the base test always ended them with 0—this was significant as the use of exclusive scan means the final value in the indices array is equal to the total count of non-zero elements if the final element is 0, but otherwise is 1 less than the total count, so I needed to make sure my implementation handled this case correctly (but since it doesn't particularly affect performance, I did not take separate metrics for it for it). + +With array size 2^24: +``` +Array Size 2^24 + +**************** +** SCAN TESTS ** +**************** + [ 33 14 48 6 37 31 38 47 49 45 1 8 38 ... 36 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 26.9298ms (std::chrono Measured) + [ 0 33 47 95 101 138 169 207 254 303 348 349 357 ... 410991594 410991630 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 26.3172ms (std::chrono Measured) + [ 0 33 47 95 101 138 169 207 254 303 348 349 357 ... 410991573 410991580 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 4.8169ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 4.69709ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 2.28694ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 2.09229ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.608256ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.741376ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 1 2 0 2 3 0 2 2 2 1 1 2 3 ... 3 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 31.3233ms (std::chrono Measured) + [ 1 2 2 3 2 2 2 1 1 2 3 1 1 ... 1 3 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 32.9821ms (std::chrono Measured) + [ 1 2 2 3 2 2 2 1 1 2 3 1 1 ... 1 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 65.5021ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 2.65523ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 2.58253ms (CUDA Measured) + passed +Press any key to continue . . . +``` diff --git a/img/ComputeNaive1.png b/img/ComputeNaive1.png new file mode 100644 index 00000000..af14d707 Binary files /dev/null and b/img/ComputeNaive1.png differ diff --git a/img/ComputeOtherPartsOfCompact.png b/img/ComputeOtherPartsOfCompact.png new file mode 100644 index 00000000..6566e295 Binary files /dev/null and b/img/ComputeOtherPartsOfCompact.png differ diff --git a/img/ComputeThrust1.png b/img/ComputeThrust1.png new file mode 100644 index 00000000..d5540f62 Binary files /dev/null and b/img/ComputeThrust1.png differ diff --git a/img/ComputeWE1.png b/img/ComputeWE1.png new file mode 100644 index 00000000..bb4cdc2f Binary files /dev/null and b/img/ComputeWE1.png differ diff --git a/img/ComputeWE2.png b/img/ComputeWE2.png new file mode 100644 index 00000000..7b7561d0 Binary files /dev/null and b/img/ComputeWE2.png differ diff --git a/img/Naive1.png b/img/Naive1.png new file mode 100644 index 00000000..eb3c599c Binary files /dev/null and b/img/Naive1.png differ diff --git a/img/Overview1.png b/img/Overview1.png new file mode 100644 index 00000000..b54654a9 Binary files /dev/null and b/img/Overview1.png differ diff --git a/img/Overview2.png b/img/Overview2.png new file mode 100644 index 00000000..acc7058d Binary files /dev/null and b/img/Overview2.png differ diff --git a/img/Screenshot 2025-09-15 134650.png b/img/Screenshot 2025-09-15 134650.png new file mode 100644 index 00000000..225f894b Binary files /dev/null and b/img/Screenshot 2025-09-15 134650.png differ diff --git a/img/Screenshot 2025-09-15 155306.png b/img/Screenshot 2025-09-15 155306.png new file mode 100644 index 00000000..66ae22b8 Binary files /dev/null and b/img/Screenshot 2025-09-15 155306.png differ diff --git a/img/WorkEfficient1.png b/img/WorkEfficient1.png new file mode 100644 index 00000000..83294abc Binary files /dev/null and b/img/WorkEfficient1.png differ diff --git a/img/Zoom1.png b/img/Zoom1.png new file mode 100644 index 00000000..02460e83 Binary files /dev/null and b/img/Zoom1.png differ diff --git a/img/Zoom1b.png b/img/Zoom1b.png new file mode 100644 index 00000000..bcdc1776 Binary files /dev/null and b/img/Zoom1b.png differ diff --git a/img/Zoom2.png b/img/Zoom2.png new file mode 100644 index 00000000..bc538dd0 Binary files /dev/null and b/img/Zoom2.png differ diff --git a/img/Zoom2b.png b/img/Zoom2b.png new file mode 100644 index 00000000..847bdcad Binary files /dev/null and b/img/Zoom2b.png differ diff --git a/img/Zoom2c.png b/img/Zoom2c.png new file mode 100644 index 00000000..d35d6303 Binary files /dev/null and b/img/Zoom2c.png differ diff --git a/img/Zoom2d.png b/img/Zoom2d.png new file mode 100644 index 00000000..9c156840 Binary files /dev/null and b/img/Zoom2d.png differ diff --git a/img/Zoom3.png b/img/Zoom3.png new file mode 100644 index 00000000..a883ae87 Binary files /dev/null and b/img/Zoom3.png differ diff --git a/img/plot1.png b/img/plot1.png new file mode 100644 index 00000000..8510a951 Binary files /dev/null and b/img/plot1.png differ diff --git a/img/plot2.png b/img/plot2.png new file mode 100644 index 00000000..c493d006 Binary files /dev/null and b/img/plot2.png differ diff --git a/img/plot3.png b/img/plot3.png new file mode 100644 index 00000000..fd7b374a Binary files /dev/null and b/img/plot3.png differ diff --git a/src/main.cpp b/src/main.cpp index 3d5c8820..710c6ca2 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,28 +13,36 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE_POW = 29; +const int SIZE = 1 << SIZE_POW; // feel free to change the size of array (<-- default 1 << 8) const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; +#define SKIP_CPU 0 +#define SKIP_NON_THRUST 0 + int main(int argc, char* argv[]) { // Scan tests + printf("Array Size 2^%d\n", SIZE_POW); printf("\n"); printf("****************\n"); printf("** SCAN TESTS **\n"); printf("****************\n"); + genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; + //genArray(SIZE, a, 50); printArray(SIZE, a, true); // initialize b using StreamCompaction::CPU::scan you implement // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. // At first all cases passed because b && c are all zeroes. zeroArray(SIZE, b); +#if !SKIP_CPU && !SKIP_NON_THRUST printDesc("cpu scan, power-of-two"); StreamCompaction::CPU::scan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); @@ -53,13 +61,13 @@ int main(int argc, char* argv[]) { printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); //printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - +#endif /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan onesArray(SIZE, c); printDesc("1s array for finding bugs"); StreamCompaction::Naive::scan(SIZE, c, a); printArray(SIZE, c, true); */ - +#if !SKIP_NON_THRUST zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); @@ -74,20 +82,25 @@ int main(int argc, char* argv[]) { //printArray(SIZE, c, true); printCmpResult(SIZE, b, c); + zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); - +#endif zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); +#if SKIP_NON_THRUST + printArray(SIZE, c, true); +#else printCmpResult(SIZE, b, c); +#endif +#if !SKIP_NON_THRUST zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); @@ -102,8 +115,9 @@ int main(int argc, char* argv[]) { // Compaction tests - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; + genArray(SIZE, a, 4); + //genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case + //a[SIZE - 1] = 0; printArray(SIZE, a, true); int count, expectedCount, expectedNPOT; @@ -111,6 +125,7 @@ int main(int argc, char* argv[]) { // initialize b using StreamCompaction::CPU::compactWithoutScan you implement // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. zeroArray(SIZE, b); +#if !SKIP_CPU printDesc("cpu compact without scan, power-of-two"); count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); @@ -130,9 +145,9 @@ int main(int argc, char* argv[]) { printDesc("cpu compact with scan"); count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); + //printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); - +#endif zeroArray(SIZE, c); printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); @@ -146,7 +161,7 @@ int main(int argc, char* argv[]) { printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); - +#endif system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index 19511caa..1646ccb3 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -19,8 +19,9 @@ list(SORT sources) source_group(Headers FILES ${headers}) source_group(Sources FILES ${sources}) - +find_package(CCCL REQUIRED) add_library(stream_compaction ${sources} ${headers}) +target_link_libraries(stream_compaction CCCL::Thrust) if(CMAKE_VERSION VERSION_LESS "3.23.0") set_target_properties(stream_compaction} PROPERTIES CUDA_ARCHITECTURES OFF) elseif(CMAKE_VERSION VERSION_LESS "3.24.0") diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d630..c39cca3b 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,11 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + bools[index] = (idata[index] != 0); } /** @@ -33,6 +38,14 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + if (!bools[index]) { + return; + } + odata[indices[index]] = idata[index]; } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed9..29cddac0 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -13,6 +13,8 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +//const int testBlockSize = 64; + /** * Check for CUDA errors; print and exit if there was a problem. */ diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa115..12fc3264 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -3,6 +3,8 @@ #include "common.h" +#include + namespace StreamCompaction { namespace CPU { using StreamCompaction::Common::PerformanceTimer; @@ -19,7 +21,12 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + if (n > 0) { + odata[0] = 0; + for (int i = 1; i < n; ++i) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } timer().endCpuTimer(); } @@ -31,8 +38,15 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int count = 0; + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) { + odata[count++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + //return -1; + return count; } /** @@ -41,10 +55,46 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + if (n == 0) { + return 0; + } + std::vector b(n); + std::vector scanOut(n); + + // Initially used: + //int* b = new int[n]; + //int* scanOut = new int[n]; + // (with freeing them later) + // probably fine either way; I'm swapping to std::vector since that's my preference in my own code writing + timer().startCpuTimer(); // TODO + + // map + for (int i = 0; i < n; ++i) { + b[i] = idata[i] != 0 ? 1 : 0; + } + + // scan + /*scan(n, scanOut.data(), b.data());*/ // <- can't just invoke since conflict with timer, so just copying over + scanOut[0] = 0; + for (int i = 1; i < n; ++i) { + scanOut[i] = scanOut[i - 1] + b[i - 1]; + } + + // scatter + for (int i = 0; i < n; ++i) { + if (b[i]) { // or could use (b[i] == 1) but here should be fine + odata[scanOut[i]] = idata[i]; + } + } + + // if last elment 1, add 1 to count since scan is exclusive + int count = scanOut[n - 1] + b[n-1]; + timer().endCpuTimer(); - return -1; + + return count; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346ee..5263efac 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,10 @@ #include "common.h" #include "efficient.h" + +#define DYNAMIC_BLOCK_SIZE 0 +// tried out an approach which in addition to reducing the # of blocks when the number of threads needed is small would reduce the size of the one remaining block when the amount of threads is less than that needs, but doesn't seem like it particularly benefits. Leaving in as an option but not using + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +16,114 @@ namespace StreamCompaction { return timer; } + + __global__ void kernUpSweep(int n, int dExpo, int* data/*, const int* idata*/) { + // TODO probably worth trying out shared memory way--I guess would move loop into here w/ syncs then change external loop to just when needing to cross boundaries + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; // sounds like this way is generally better (more explicit that thread can stop) + } + int d2 = dExpo << 1; + int k = (index) * d2; + data[k + d2 - 1] += data[k + dExpo - 1]; + + } + __global__ void kernDownSweep(int n, int dExpo, int* data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + int d2 = dExpo << 1; + int k = (index)*d2; + + int t = data[k + dExpo - 1]; + data[k + dExpo - 1] = data[k + d2 - 1]; + data[k + d2 - 1] += t; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int blockSize = 32; // Seems to perform best here for 32 or 64 but the change is generally pretty minor + int dTarget = ilog2ceil(n); + int pow2Size = 1 << dTarget; + int nCap = pow2Size >> 1; + dim3 fullBlocksPerGrid((nCap + blockSize - 1) / blockSize); + // TODO I'm still not sure if getting fullBlocksPerGrid right 100%, might be overshooting + int* dev_idata; + + // TODO I think want to rewrite into using shared memory way but that's extra credit so I think don't need to + // moving on for now but plan to revisit and do that + // also not sure if I've already done what part 5 is referring to? since I don't do modulo and such and decrease # of threads + + cudaMalloc((void**)&dev_idata, pow2Size * sizeof(int)); + + cudaMemset(dev_idata + n, 0, sizeof(int) * (pow2Size - n)); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + +#if DYNAMIC_BLOCK_SIZE + int blockSize2 = blockSize; +#endif + timer().startGpuTimer(); + // TODO + + // up-sweep + int dExpo = 1; // = 2^(d) + + for (int d = 0; d < dTarget; ++d) { +#if DYNAMIC_BLOCK_SIZE + kernUpSweep<<>> (nCap, dExpo, dev_idata); +#else + kernUpSweep<<>> (nCap, dExpo, dev_idata); +#endif + if (d < dTarget - 1) { + dExpo <<= 1; + nCap >>= 1; + fullBlocksPerGrid.x = ((nCap + blockSize - 1) / blockSize); // Not sure this is totally the best way to set this but does massively reduce runtime (e.g. ~6ms to ~2ms) + +#if DYNAMIC_BLOCK_SIZE + // could reduce blockSize when gets really low but not sure worth the effort + // doesn't seem to really matter, so just turning off but making an option anyway + if (nCap < blockSize) { + blockSize2 = nCap; + //blockSize2 = ((nCap - 1) / 32 + 1) * 32; + } + else { + blockSize2 = blockSize; + } +#endif + } + } + // set last element to 0 + cudaMemset(dev_idata + (pow2Size - 1), 0, sizeof(int)); + // down-sweep + for (int d = dTarget - 1; d >= 0; --d) { +#if DYNAMIC_BLOCK_SIZE + if (nCap < blockSize) { + // 1 block with block size = # elements used + blockSize2 = nCap; + } + else { + blockSize2 = blockSize; + } + kernDownSweep<<>>(nCap, dExpo, dev_idata); +#else + kernDownSweep<<>>(nCap, dExpo, dev_idata); +#endif + if (d > 0) { + dExpo >>= 1; + nCap <<= 1; + fullBlocksPerGrid.x = ((nCap + blockSize - 1) / blockSize); + } + } + + timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(dev_idata); } /** @@ -31,10 +136,95 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + + + //int blockSize = testBlockSize; // TODO optimize + int blockSize = 512; + //512 seems to give best performance out of the values tested, but then scan on its own seems best at 32, so using 32 for scan and 512 for rest--seems to give better performance than same size for all the steps + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + int* dev_idata; + int* dev_odata; + int* dev_boolArray; + int* dev_indices; + + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + + int scanBlockSize = 32; + int dTarget = ilog2ceil(n); + int pow2Size = 1 << dTarget; + int nCap = pow2Size >> 1; + dim3 fullBlocksPerGridScan((nCap + scanBlockSize - 1) / scanBlockSize); + + cudaMalloc((void**)&dev_boolArray, pow2Size * sizeof(int)); + cudaMalloc((void**)&dev_indices, pow2Size * sizeof(int)); + + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemset(dev_odata, 0, sizeof(int) * n); + cudaMemset(dev_boolArray + n, 0, sizeof(int) * (pow2Size - n)); + + + + + timer().startGpuTimer(); // TODO + + // map + Common::kernMapToBoolean<<>>(n, dev_boolArray, dev_idata); + + // scan + cudaMemcpy(dev_indices, dev_boolArray, n * sizeof(int), cudaMemcpyDeviceToDevice); + // up-sweep + int dExpo = 1; // = 2^(d) + + for (int d = 0; d < dTarget; ++d) { + + kernUpSweep << > > (nCap, dExpo, dev_indices); + + if (d < dTarget - 1) { + dExpo <<= 1; + nCap >>= 1; + fullBlocksPerGridScan.x = ((nCap + scanBlockSize - 1) / scanBlockSize); // Not sure this is totally the best way to set this but does massively reduce runtime (e.g. ~6ms to ~2ms) + } + + } + // set last element to 0 + cudaMemset(dev_indices + (pow2Size - 1), 0, sizeof(int)); + // down-sweep + for (int d = dTarget - 1; d >= 0; --d) { + kernDownSweep << > > (nCap, dExpo, dev_indices); + if (d > 0) { + dExpo >>= 1; + nCap <<= 1; + fullBlocksPerGridScan.x = ((nCap + scanBlockSize - 1) / scanBlockSize); + } + } + + + + + // scatter + Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_boolArray, dev_indices); + + + + timer().endGpuTimer(); - return -1; + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + int count; + int countStep; + cudaMemcpy(&count, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&countStep, dev_boolArray + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_boolArray); + cudaFree(dev_indices); + + + return count + countStep; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 43088769..7aaf2347 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -3,6 +3,7 @@ #include "common.h" #include "naive.h" + namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; @@ -13,13 +14,52 @@ namespace StreamCompaction { } // TODO: __global__ + __global__ void kernScanStep(int n, int dExpo, int *odata, const int *idata) { + int k = threadIdx.x + (blockIdx.x * blockDim.x); + if (k >= n) { + return; + } + if (k >= dExpo) { + odata[k] = idata[k - dExpo] + idata[k]; + } + else { + odata[k] = idata[k]; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int blockSize = 512; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + int *dev_idata; + int *dev_odata; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + + cudaMemset(dev_idata, 0, sizeof(int)); + cudaMemcpy(dev_idata + 1, idata, sizeof(int) * (n - 1), cudaMemcpyHostToDevice); // TODO not sure best way to do this, but makes exclusive rather than inclusive + cudaMemset(dev_odata, 0, sizeof(int) * n); // <- I think unnecessary? + + //this seems to perform worse than CPU but I assume that's expected when doing the naive approach since this obviously is inefficient, and I'm only testing on a small dataset anyway rn so the overhead of this outweighs any parallelism benefit + int dTarget = ilog2ceil(n); + int dExpo = 1; // = 2^(d-1) timer().startGpuTimer(); // TODO + for (int d = 1; d <= dTarget; ++d) { + kernScanStep<<>>(n,dExpo,dev_odata,dev_idata); + + if (d < dTarget) { + dExpo *= 2; + std::swap(dev_idata, dev_odata); + } + } timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e7..9521b0c3 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,22 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::host_vector hv_in(idata, idata + n); + thrust::device_vector dv_in = hv_in; + thrust::device_vector dv_out(n); + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + + // earlier comment: this gives same result but performs way worse than bespoke kernels; dunno if that's how it's expected to be + // ^It's just the case on debug; faster on release. I assume just thrust e.g. keeps more info available for debug hence worse performance + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); timer().endGpuTimer(); + + thrust::copy(dv_out.begin(), dv_out.end(), odata); } } }