-
Notifications
You must be signed in to change notification settings - Fork 212
Tuner timeout #486
base: master
Are you sure you want to change the base?
Tuner timeout #486
Conversation
Also, added timeout flag to set default timeout to options. However, the flag does not work when the option is initialized before gflags. The flags set the timeout in ms. __timestamp() function is used to get the current timestamp(). This function should not be used according to nvidia, so it could have a different behavior in some devices. To have a timeout in CUDA, the blocks should first now the timestamp of the kernel launch. To do that, the firsts instruction of every block is to retrieve the timestamp stored in the global memory. If the value is 0 (it is at the start of the kernel), the block set the value to the current timestamp. All of that is done atomically. It might happend that the timestamp stored is not the lowest timestamp that blocks have computed, but it is close. After that, timeout checks are inserted in the kernel code, which checks if the kernel has ran more than n ns. The checks are inserted after some for loops and some sequences, and the checks are inserted such that between two checks, there is at least timeout_check_frequency iterations of for loops, to ensure that checks do not influence much the results. timeout_check_frequency can be modified by a flag.
constexpr auto timestampFunction = R"C( | ||
__device__ unsigned long long __timestamp() { | ||
unsigned long long startns; | ||
asm volatile("mov.u64 %0,%%globaltimer;" : "=l"(startns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you mention NVIDIA explicitly states we should not use globaltime, why not use something based on clock64
, see an example here. We can get the frequency from the device property and turn that into time. This should also be significantly lower overhead than fectching data from global memory.
@@ -318,7 +323,8 @@ CudaMappingOptions CudaMappingOptions::makeUnmappedMappingOptions() { | |||
.useSharedMemory(false) | |||
.usePrivateMemory(false) | |||
.unrollCopyShared(false) | |||
.useReadOnlyCache(false); | |||
.useReadOnlyCache(false) | |||
.timeout(FLAGS_timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a good justification to adding debug type information to the protos, I would kill this with fire.
CudaRTCFunction::CudaRTCFunction() {} | ||
CudaRTCFunction::CudaRTCFunction() { | ||
TC_CUDA_RUNTIMEAPI_ENFORCE( | ||
cudaMalloc((void**)&startTimeDev, sizeof(unsigned long long))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the clock instructions should make this unnecessary
@@ -171,6 +185,15 @@ Duration CudaRTCFunction::Launch( | |||
unsigned int bx = block[0]; | |||
unsigned int by = block[1]; | |||
unsigned int bz = block[2]; | |||
if (timeout != 0) { | |||
unsigned long long startTime = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
significant, unnecessary complexity once you use clock
@@ -57,13 +57,15 @@ class CudaRTCFunction { | |||
std::vector<long> params, | |||
std::vector<void*> outputs, | |||
std::vector<const void*> inputs, | |||
uint32_t timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely not, we JIT compile kernels, let's hardcode the timeouts instead of increasing the complexity of the flow
The general idea is important and if it already works it is great. However it comes with too much technical debt to my taste. Let's try to:
I would much rather a global variable (as much as I hate them) than increase the complexity of the APIs. But maybe with the use of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments, thanks!
|
Should be pretty easy to transform your kernel timeout into a per block timeout: we always know statically the number of SMs and the number of blocks. So something like:
We compile a new version for each set of options at each generation, so I don't see the issue, just impl a dynamic timeout in the tuner and hardcode it. I see that one could object that it would prevent memoizing the generated cuda / ptx during tuning but that is premature optimization IMO. The value of keeping a simple and clean abstraction is orders of magnitude more important.
Agreed, OTOH I would argue that we don't need something precise, just something that works "well enough". The whole point here is to skip catastrophically bad cases that are so bad that even our pruning with the chainsaw didn't catch (or at least on the first iteration). Additionally, @ftynse it seems some of the issues this PR wants to catch is on the first iteration where there is no "best kernel" to compare to and we just end up executing really bad ones multiple times. How about putting in the timeout value in the pruning function? This would guarantee we would only execute the bad function once which already catches 90%+ of the useful cases. I still think the value of having a way to interrupt kernels is important but simplicity first without any hesitation.
we're working on learning some :) def don't want to engineer those. |
Implements timeout for cuda backend using mapping option.
Also adds a flag to change the default mapping option of the timeout.
The aim of this PR is to allow the use of timeouts in the autotuner, where sometimes kernel of 1s can appear where 5ms can be achieved. As for now, the
timeout
flag can be used to set a timeout for allproduced kernels in the autotuner.
Closes #394
Tag #381