glados::cuda::launch-kernel does not create good configuration #11

tobiashuste · 2016-12-14T08:48:55Z

The block sizes which are computed for 2- or 3-dimensional kernels are not multiples of 32.
e.g.: 432x500 2-dimensional threads shall be created which results in this configuration:

Block size: (2,2) -> no multiple of 32
Grid size: (216,250)

When invoking the kernel with a block size of (16,16) and (27, 32) the average kernel runtime is nearly 10 times faster.

j-stephan · 2016-12-14T12:52:32Z

Proposed new algorithm:

Set a fixed blocksize (e.g. 16x16 for 2D or 16x16x2 for 3D).
Calculate the gridsize based upon the blocksize and the input sizes.
Launch the kernel.

As the launch functions are designed to be "fire and forget" functions I don't see major performance issues with this approach. If your (not you personally but in general) kernel really really needs the performance you should hand-tune it anyway.

Thoughts?

tobiashuste · 2016-12-14T12:59:59Z

Yes, the proposed algorithm will completeley suffice.
The function must definitely forsure that the blocksize is a multiple of 32, this is a prerequisit of CUDA for reasonable performance (Exceptions confirm the rule ;)). I think the user expects, that the basic requirements are met, if he calls this function.
Of course, all further or special tweaks need to be defined by the user, but this prerequisite needs to be fulfilled by this launch function.

j-stephan · 2016-12-14T13:58:50Z

Thanks for the notice. My old laptop is now able to reconstruct a 1070x1070x1033 volume 20% faster :).

This closes #11.

BieberleA · 2016-12-14T14:17:45Z

Perfect!

BieberleA · 2016-12-14T15:08:57Z

Could you please perform timing measurements on K20c and GTC1080 with PARIS using the new structured GLADOS?

j-stephan · 2016-12-14T15:48:09Z

Sure, I'll try to squeeze it in on Monday or so. Otherwise in the new year, is that sufficient?

j-stephan · 2016-12-14T15:52:18Z

See hzdr/PARIS#29 for further reference.

tobiashuste added bug enhancement labels Dec 14, 2016

tobiashuste assigned j-stephan Dec 14, 2016

j-stephan closed this as completed in 1915e56 Dec 14, 2016

j-stephan added a commit that referenced this issue Dec 14, 2016

Bug fix: Underperforming CUDA launch

f4f1318

This closes #11.

BieberleA reopened this Dec 14, 2016

j-stephan closed this as completed Dec 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glados::cuda::launch-kernel does not create good configuration #11

glados::cuda::launch-kernel does not create good configuration #11

tobiashuste commented Dec 14, 2016

j-stephan commented Dec 14, 2016

tobiashuste commented Dec 14, 2016

j-stephan commented Dec 14, 2016

BieberleA commented Dec 14, 2016

BieberleA commented Dec 14, 2016

j-stephan commented Dec 14, 2016

j-stephan commented Dec 14, 2016

glados::cuda::launch-kernel does not create good configuration #11

glados::cuda::launch-kernel does not create good configuration #11

Comments

tobiashuste commented Dec 14, 2016

j-stephan commented Dec 14, 2016

tobiashuste commented Dec 14, 2016

j-stephan commented Dec 14, 2016

BieberleA commented Dec 14, 2016

BieberleA commented Dec 14, 2016

j-stephan commented Dec 14, 2016

j-stephan commented Dec 14, 2016