CUDA: efficient parallel reduction

CUDA is a very powerful API which allows us to run highly parallel software on Nvidia GPUs. It is typically used to accelerate specific operations, called kernels, such as matrix multiplication, matrix decomposition, training neural networks et cetera. One such common operation is a reduction: adding up a long array of numbers. One simple implementation