Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support suspend/resume with AMDGPU on bare metal server #181

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jiangliu
Copy link

@jiangliu jiangliu commented Jan 3, 2025

When testing suspend/resume with AMDGPU device on bare metal servers, it fails to resume on the third time. Fix the issue by resetting the ASIC when needed.

When GPU suspend is aborted, do the same for dGPU as APU to reset
soc15 asic. Otherwise it may cause following errors:
[  547.229463] amdgpu 0001:81:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)

[  555.126827] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[  555.126901] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[  555.126957] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_4_3> failed -110
[  555.126959] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
[  555.126965] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110
[  555.126966] PM: Device 0000:0a:00.0 failed to resume async: error -110

Signed-off-by: Jiang Liu <[email protected]>
Tested-by: Shuo Liu <[email protected]>
commit 9cef84b
drm/amdgpu: update suspend status for aborting from deeper suspend

There're some other suspend abort cases which can call the noirq
suspend except for executing _S3 method. In those cases need to
process as incomplete suspendsion.

Signed-off-by: Jiang Liu <[email protected]>
@superm1
Copy link
Contributor

superm1 commented Jan 9, 2025

Can you please bring these to the amd-gfx M/L? Kernel patches are reviewed there.

@jiangliu
Copy link
Author

Can you please bring these to the amd-gfx M/L? Kernel patches are reviewed there.

Sure, will do that.
This patchset only applies to this repo and conflicts with amd-staging-drm-next, is it work to send it to amd-gfx maillist?

@superm1
Copy link
Contributor

superm1 commented Jan 10, 2025

Can you please rebase and adjust conflicts on AMD staging drm next?

This is the way all new changes start. We can do a backport to the dkms and other branches after it's landed.

@jiangliu
Copy link
Author

Can you please rebase and adjust conflicts on AMD staging drm next?

This is the way all new changes start. We can do a backport to the dkms and other branches after it's landed.

The code logic is different on these two repos, and this change only applies to this repo. The amd-staging-drm-next repo has different code base here, so I can't rebase to it:(

@superm1
Copy link
Contributor

superm1 commented Jan 10, 2025

Yesh the logic changed in newer kernel. I believe the specific commit in question is torvalds/linux@d5e3d8a.

If you port that to this branch does it work properly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants