Performance doubles by only changing one line of code

Hi xiandong, 

Thanks for providing this amazing tutorial! Recently I am working on reduce0 and I found that I can double the performance of reduce_v0_baseline.cu kernel by simply changing a blockDim.x into THREAD_PER_BLOCK in the for loop

before

![Image](https://github.com/user-attachments/assets/515de7b5-f9e3-4d2e-acd8-b74ec8ec8ace)

profile result:

![Image](https://github.com/user-attachments/assets/e310710d-ff10-47ba-b033-8ea6d39abe49)

after

![Image](https://github.com/user-attachments/assets/850801cc-0a8d-4c87-ab32-2fbd9fccb440)

profile result:

![Image](https://github.com/user-attachments/assets/99e09602-5108-420b-847f-1e56760669ff)

I guess this is because of loop unrolling? It's quite interesting that a simple change makes a big difference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance doubles by only changing one line of code #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance doubles by only changing one line of code #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions