CUDA: efficient parallel reduction

CUDA is a very powerful API which allows us to run highly parallel software on Nvidia GPUs. It is typically used to accelerate specific operations, called kernels, such as matrix multiplication, matrix decomposition, training neural networks et cetera.
One such common operation is a reduction: adding up a long array of numbers. One simple implementation of a reduction kernel (run on the CPU) might look like this in C++:

float reduction(float* A, size_t N) {
    float result = 0;
    for (size_t i = 0; i < N; i++) {
        result += A[i];
    }
    return result;
}

Now, if we have a very large N, this can get quite slow, so we want to parallelize this kernel. If we’re on a single CPU, with say 8 threads, this can be fairly straightforward: divide the array in 8 parts, have the 8 threads sum up their elements, and finally sum up the 8 partial results. Classic divide-and-conquer.
On a GPU however, we have thousands of threads subdivided into (potentially) hundreds of blocks. If we want to obtain maximum performance, we need to be careful about how we go about divvying up the workload among blocks, and how we sum up partial results.

Mark Harris of Nvidia made a deep dive into how to optimize a CUDA reduction kernel. Here is a complete implementation, which includes (almost) all the tricks explained in Harris’ slides, and generalizes the kernel to work for a generic type T.

// gpu_reduce.hpp
#ifndef GPU_REDUCE
#define GPU_REDUCE

#include <stdio.h>

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err)
    {
        fprintf(stderr, "CUDA Error: %s: %s.\n", msg, cudaGetErrorString(err) );
        exit(EXIT_FAILURE); 
    }
}

template <size_t blockSize, typename T>
__device__ void warpReduce(volatile T *sdata, size_t tid)
{
    if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
    if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
    if (blockSize >= 16) sdata[tid] += sdata[tid +  8];
    if (blockSize >=  8) sdata[tid] += sdata[tid +  4];
    if (blockSize >=  4) sdata[tid] += sdata[tid +  2];
    if (blockSize >=  2) sdata[tid] += sdata[tid +  1];
}

template <size_t blockSize, typename T>
__global__ void reduceCUDA(T* g_idata, T* g_odata, size_t n)
{
    __shared__ T sdata[blockSize];

    size_t tid = threadIdx.x;
    //size_t i = blockIdx.x*(blockSize*2) + tid;
    //size_t gridSize = blockSize*2*gridDim.x;
    size_t i = blockIdx.x*(blockSize) + tid;
    size_t gridSize = blockSize*gridDim.x;
    sdata[tid] = 0;

    while (i < n) { sdata[tid] += g_idata[i]; i += gridSize; }
    //while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }
    __syncthreads();

    if (blockSize >= 1024) { if (tid < 512) { sdata[tid] += sdata[tid + 512]; } __syncthreads(); }
    if (blockSize >=  512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >=  256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >=  128) { if (tid <  64) { sdata[tid] += sdata[tid +  64]; } __syncthreads(); }

    if (tid < 32) warpReduce<blockSize>(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}


// PRE:
// dA is an array allocated on the GPU
// N <= len(dA) is a power of two (N >= BLOCKSIZE)
// POST: the sum of the first N elements of dA is returned
template<size_t blockSize, typename T>
T GPUReduction(T* dA, size_t N)
{
    T tot = 0.;
    size_t n = N;
    size_t blocksPerGrid = std::ceil((1.*n) / blockSize);

    T* tmp;
    cudaMalloc(&tmp, sizeof(T) * blocksPerGrid); checkCUDAError("Error allocating tmp [GPUReduction]");

    T* from = dA;

    do
    {
        blocksPerGrid   = std::ceil((1.*n) / blockSize);
        reduceCUDA<blockSize><<<blocksPerGrid, blockSize>>>(from, tmp, n);
        from = tmp;
        n = blocksPerGrid;
    } while (n > blockSize);

    if (n > 1)
        reduceCUDA<blockSize><<<1, blockSize>>>(tmp, tmp, n);

    cudaDeviceSynchronize();
    checkCUDAError("Error launching kernel [GPUReduction]");

    cudaMemcpy(&tot, tmp, sizeof(T), cudaMemcpyDeviceToHost); checkCUDAError("Error copying result [GPUReduction]");
    cudaFree(tmp);
    return tot;
}

#endif

One important thing to note: this function only works if the number of inputs (N) is a power of two. In some cases, this isn’t actually an issue, as array sizes can be arbitrarily set. Other times, the problem itself has some structure that can be exploited. In general, you can look for the largest power of two smaller than N , or pad the end of the array with zeros, such that the end result is unaffected by the extra values.

Below is one example. Here I had a square grid of elements, where each side had $2^{N_0} + 1$ elements – so $2^{2N_0} + 2^{N_0 + 1} + 1$ in total. I run GPUReduction twice, and sum up the two results with the last value in the array:

#include<iostream>
#include<cmath>
#include "gpu_reduce.hpp"

#define BLOCKSIZE 1024

int main()
{
    using my_t = unsigned;

    size_t N0 = 14; // Grid has a side length of 2^N0 + 1
    size_t n = std::pow(2, N0) + 1;
    size_t N = n*n;

    // Populate array
    my_t* A = (my_t*) malloc(sizeof(my_t) * N);
    for (size_t i = 0; i < N; i++)
        A[i] = 1;

    my_t* dA;
    cudaMalloc(&dA, sizeof(my_t)*N); checkCUDAError("Error allocating dA");
    cudaMemcpy(dA, A, sizeof(my_t)*N, cudaMemcpyHostToDevice); checkCUDAError("Error copying A"); 

    my_t tot = 0.;

    size_t nPar1 = (n-1) * (n-1);
    size_t nPar2 = N - nPar1 - 1;

    tot  = GPUReduction<BLOCKSIZE>(dA, nPar1);
    tot += GPUReduction<BLOCKSIZE>(dA + nPar1, nPar2);
    tot += A[nPar1 + nPar2];

    std::cout << "N: " << N << std::endl;
    std::cout << "Result: " << tot << std::endl;
}