Skip to content
This repository has been archived by the owner on May 13, 2022. It is now read-only.

glados::cuda::launch-kernel does not create good configuration #11

Closed
tobiashuste opened this issue Dec 14, 2016 · 7 comments
Closed

glados::cuda::launch-kernel does not create good configuration #11

tobiashuste opened this issue Dec 14, 2016 · 7 comments
Assignees

Comments

@tobiashuste
Copy link
Member

The block sizes which are computed for 2- or 3-dimensional kernels are not multiples of 32.
e.g.: 432x500 2-dimensional threads shall be created which results in this configuration:

  • Block size: (2,2) -> no multiple of 32
  • Grid size: (216,250)

When invoking the kernel with a block size of (16,16) and (27, 32) the average kernel runtime is nearly 10 times faster.

@j-stephan
Copy link
Contributor

Proposed new algorithm:

  1. Set a fixed blocksize (e.g. 16x16 for 2D or 16x16x2 for 3D).
  2. Calculate the gridsize based upon the blocksize and the input sizes.
  3. Launch the kernel.

As the launch functions are designed to be "fire and forget" functions I don't see major performance issues with this approach. If your (not you personally but in general) kernel really really needs the performance you should hand-tune it anyway.

Thoughts?

@tobiashuste
Copy link
Member Author

Yes, the proposed algorithm will completeley suffice.
The function must definitely forsure that the blocksize is a multiple of 32, this is a prerequisit of CUDA for reasonable performance (Exceptions confirm the rule ;)). I think the user expects, that the basic requirements are met, if he calls this function.
Of course, all further or special tweaks need to be defined by the user, but this prerequisite needs to be fulfilled by this launch function.

@j-stephan
Copy link
Contributor

Thanks for the notice. My old laptop is now able to reconstruct a 1070x1070x1033 volume 20% faster :).

j-stephan added a commit that referenced this issue Dec 14, 2016
@BieberleA
Copy link
Collaborator

Perfect!

@BieberleA BieberleA reopened this Dec 14, 2016
@BieberleA
Copy link
Collaborator

Could you please perform timing measurements on K20c and GTC1080 with PARIS using the new structured GLADOS?

@j-stephan
Copy link
Contributor

Sure, I'll try to squeeze it in on Monday or so. Otherwise in the new year, is that sufficient?

@j-stephan
Copy link
Contributor

See hzdr/PARIS#29 for further reference.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants