-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU task switching causes computation errors for asteroids@home when using 2 or more different models of gpu of the same type #5743
Comments
I'm pretty much sure this is an issue of the project application, because for every task we assign on start-up ID (0, 1, 2, etc) of the GPU to be used.
|
AFAIK the client doesn't have a mechanism for pinning a job to a GPU. |
The same issue is a significant problem at GPUGrid. init_data contains the correct <gpu_device_num> for a running task. But if BOINC is stopped and restarted, there is no guarantee that the same GPU will be assigned by BOINC. If the new GPU is identical to the previous run, the task restarts normally. But if it is not identical, the task crashes, potentially losing several hours of work. The crash is initiated by the project application, but could be prevented by the BOINC client remembering and reusing the device allocation at startup. NB consider respecting previous OpenCL device numbers too, although I've only seen the problem for cuda apps. |
The issue is whether the task crashes because it runs an a different GPU than where it started, The former seems odd - why would a checkpoint file be specific to a GPU instance? |
I'm looking through my recent errors for an example of the specific failure case, but I haven't found one yet. From memory, the problem occurs from the 'just in time' GPU code compiler. At GPUGrid, this produces code which is specific to the individual GPU type used in the first run, If the second GPU is different, the by now pre-compiled code is incompatible with the hardware. |
Can't find an error on my own machines - I know from bitter experience that I have to avoid shutdowns when GPUGrid work is running. But see https://www.gpugrid.net/forum_thread.php?id=5461 for a report/response on their message board. |
@davidpanderson, as @RichardHaselgrove already mentioned, it's very important that the task that started to run on particular GPU will stick to it forever, otherwise it's not guaranteed that the computation could be continued even from the checkpoint. |
Yes, it appears that some GPU apps generate the code for the specific hardware used. Here is the error output from a failed task from asteroids at home. <stderr_txt> Error creating queue: build program failure(-11) </stderr_txt> |
The <gpu_device_num> values appeared the same after resuming computation, but in BOINC manager, the task that said "device 0" likely said "device 1" before the error, but the error happens immediately after resuming, so it is hard to tell, although I have seen this swap occur with other applications from other projects. And the following error from the above post would indicate that the tasks are sometimes swapping GPUs: Error: The program ISA amdgcn-amd-amdhsa--gfx1032 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx1102 gfx1032 is RX 6600 |
One option would be for the app to compile its kernels each time it starts. |
If we pin each GPU job to a GPU instance, the following could happen: We now have 2 jobs pinned to GPU 0; GPU 1 is idle. To avoid this, we'd have to extend the simulation done by the work fetch logic |
Describe the bug
If GPU computation is suspended during use or when an exclusive application is running, when computation resumes, BOINC sometimes swaps which task is on which GPU. This causes a computation error for asteroids@home tasks when using multiple AMD GPUs, for example, an RX 7600 XT and an RX 6600. This might be an application specific issue, but it might be a good idea to have an option to not switch tasks between GPUs if possible, unless, for example, one GPU is removed, in which case all the tasks would have to run on the remaining GPU.
Steps To Reproduce
Expected behavior
I would expect the task to stay on the GPU it started on if that is necessary for the task to finish. An option to disable gpu task switching is a potential solution, or tasks could specify weather or not they can be switched.
Screenshots
System Information
Additional context
The text was updated successfully, but these errors were encountered: