Hi xiandong,
Thanks for providing this amazing tutorial! Recently I am working on reduce0 and I found that I can double the performance of reduce_v0_baseline.cu kernel by simply changing a blockDim.x into THREAD_PER_BLOCK in the for loop
before

profile result:

after

profile result:

I guess this is because of loop unrolling? It's quite interesting that a simple change makes a big difference