Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change to new graphic test strategy (BugFix) #586

Merged
merged 15 commits into from
Mar 18, 2024

Conversation

hanhsuan
Copy link
Contributor

@hanhsuan hanhsuan commented Jun 28, 2023

Description

Change the GPU test to prime/revert prime strategy to make test jobs to meet real usage, and remove redundant jobs.

Resolved issues

Tracking in this Jira card
Github issue #491

Submissions

I + I

I + A

A + N (This DUT will make nvidia driver error while suspend/resume)

A only

@hanhsuan hanhsuan requested review from pieqq and yphus June 28, 2023 05:48
pieqq
pieqq previously requested changes Jun 29, 2023
Copy link
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this big update! You've led a great work, first with some researches and documentation on the prime and reverse prime process, then with this implementation!

I made some suggestions inline, please check them and let me know if you have any question.

I've seen that you've created new test plans (for instance graphics-gpu-cert-automated), but it's not immediately clear why this is so. My understanding is that they are only to be used with 22.04+ test plans, in order to keep backward compatibility with previous OSes and jobs? It would be good to have some kind of explanation somewhere (I like to put this kind of info in the git commit message, because there is a higher change to find this info back when doing code archeology later on).

providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/units/graphics/jobs.pxu Outdated Show resolved Hide resolved
providers/base/units/graphics/jobs.pxu Outdated Show resolved Hide resolved
providers/base/units/graphics/jobs.pxu Outdated Show resolved Hide resolved
providers/base/units/graphics/jobs.pxu Outdated Show resolved Hide resolved
providers/base/units/graphics/jobs.pxu Outdated Show resolved Hide resolved
providers/base/units/graphics/jobs.pxu Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
@hanhsuan hanhsuan force-pushed the change_to_new_graphic_test_strategy branch from 141c780 to aaf9eda Compare June 30, 2023 01:07
@hanhsuan
Copy link
Contributor Author

@pieqq I've fixed the code and description as your suggestion. If there is something I missed, please let me know.

Copy link
Contributor

@diohe0311 diohe0311 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hanhsuan, thanks for well explanation!
I'll run test on DUTs based on our discussion.

providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
@diohe0311
Copy link
Contributor

diohe0311 commented Oct 26, 2023

I couldn't find the test job below when running graphics-gpu-cert-automated, also failed to run it directly, am I missing something? could you please provide your test steps?

  • graphics/1_glxgears_auto_.*
  • graphics/2_glxgears_auto_.*

steps to reproduce:

  1. sideload base provider in Vision-PV-SKU16_202203-30095, IOKE-PV-SKU6_202304-31461
  2. run sudo lshw -C display (check gpu combination)
  3. run checkbox-cli
  4. select graphics-gpu-cert-automated
  5. can't find the jobs under Graphics tests
    • graphics/1_driver_version_PCI_ID_0x4688
    • graphics/1_gl_support_PCI_ID_0x4688
    • graphics/1_minimum_resolution_PCI_ID_0x4688
    • graphics/VESA_drivers_not_in_use
  6. run checkbox-cli run com.canonical.certification::graphics/glxgears, and get Error: couldn't open display (null)
  7. run ./prime_offload_tester.py -c glxgears -p 0000:01:00.0 -d nvidia -t 30, and get Error: couldn't open display (null)

Test steps is fine, just can't be run via ssh, because DUT will try to ask host run graphic test and failed due to no permission(and even if use ssh -x to allow it, it will use the GPU in host, so still nono)

@diohe0311
Copy link
Contributor

Hey @hanhsuan, as discussed, we ran ODM Client Certification for Desktop 22.04 - (1/2) Manual tests and found automated test jobs under it, could you please check on that?

@hanhsuan hanhsuan force-pushed the change_to_new_graphic_test_strategy branch from 67c4a91 to b129c98 Compare October 30, 2023 03:46
@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

Attention: Patch coverage is 96.66667% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 39.00%. Comparing base (c3c3f45) to head (2cf82cb).
Report is 180 commits behind head on main.

Files Patch % Lines
providers/base/bin/prime_offload_tester.py 97.43% 2 Missing and 1 partial ⚠️
providers/resource/bin/graphics_card_resource.py 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #586      +/-   ##
==========================================
+ Coverage   34.83%   39.00%   +4.16%     
==========================================
  Files         302      307       +5     
  Lines       34165    35240    +1075     
  Branches     5909     6058     +149     
==========================================
+ Hits        11903    13744    +1841     
+ Misses      21697    20890     -807     
- Partials      565      606      +41     
Flag Coverage Δ
provider-base 14.52% <97.43%> (+11.39%) ⬆️
provider-resource 22.85% <66.66%> (+10.84%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hanhsuan
Copy link
Contributor Author

Hey @hanhsuan, as discussed, we ran ODM Client Certification for Desktop 22.04 - (1/2) Manual tests and found automated test jobs under it, could you please check on that?

@diohe0311 Thanks for you help to noticed me some issues in my PR, I have modified to make ODM Client Certification for Desktop 22.04 - (1/2) Manual tests include correct test-plan.

@kissiel
Copy link
Contributor

kissiel commented Oct 30, 2023

/canonical/self-hosted-runners/run-workflows 28e1ba4

… depending on index

For Nvidia GPU, the prime/reverse prime offload is not supported before
version 435.17. Therefore, This new strategy is only for 22.04+.
For backward compatibility, this PR add new test plans for 22.04+ as
follow:
    graphics-gpu-cert-full
        graphics-gpu-cert-automated
        graphics-gpu-cert-manual
    after-suspend-graphics-gpu-cert-full
        after-suspend-graphics-gpu-cert-automated
        after-suspend-graphics-gpu-cert-manual
    monitor-gpu-cert-full
        monitor-gpu-cert-automated
        monitor-gpu-cert-manual
    after-suspend-monitor-gpu-cert-full
        after-suspend-monitor-gpu-cert-automated
        after-suspend-monitor-gpu-cert-manual

And add new python script "prime_offload_tester.py" to execute command
with prime/reverse prime setting for new test jobs as follow:
    Auto test:
        graphics/{index}_auto_glxgears_{product_slug}
        graphics/{index}_auto_glxgears_fullscreen_{product_slug}
    Manual:
        graphics/{index}_valid_glxgears_{product_slug}
        graphics/{index}_valid_glxgears_fullscreen_{product_slug}
@hanhsuan hanhsuan force-pushed the change_to_new_graphic_test_strategy branch from 28e1ba4 to 79599f8 Compare October 31, 2023 00:09
@pieqq
Copy link
Collaborator

pieqq commented Oct 31, 2023

/canonical/self-hosted-runners/run-workflows 79599f8

@hanhsuan hanhsuan changed the title Change to new graphic test strategy Change to new graphic test strategy (BugFix) Oct 31, 2023
@diohe0311
Copy link
Contributor

diohe0311 commented Nov 23, 2023

/canonical/self-hosted-runners/run-workflows 93c5634

diohe0311
diohe0311 previously approved these changes Nov 23, 2023
Copy link
Contributor

@diohe0311 diohe0311 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM+1

providers/base/units/monitor/test-plan.pxu Outdated Show resolved Hide resolved
Copy link
Contributor

@kissiel kissiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've hit 20 comments mark.
Overall comments are:

  • I think the logic regarding not find card is wrong (and test also reflects the wrong logic)
  • the error handling is unnecessarily complicated, and not following python

The branch should have been easily split into smaller chunks. For instance the checkbox job definitions don't have to be here. Let's start with the testing program.

providers/base/tests/test_prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
@hanhsuan
Copy link
Contributor Author

  • I think the logic regarding not find card is wrong (and test also reflects the wrong logic)

I couldn't figure out where is the logic error, cloud you explain more about this?

2. add extra method for avoid checking fail by 6.5 kernel bug
@hanhsuan
Copy link
Contributor Author

The submissions for the new commit:
Intel only
Intel + Nvidia
Intel + Nvidia remote test

@hanhsuan
Copy link
Contributor Author

The bug of 6.5 kernel is a public bug now.

Copy link
Contributor

@kissiel kissiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are time.sleep()s being run when executing unit tests from this PR (it takes 2 minutes to run unit tests)
  2. The code is unnecessarily complicated - see below.
  3. There are unnecessary double exception contexts.
  4. The if __name__ == "__main__": is abused, should only call main
  5. The PR is too big. Please split it into smaller chunks (like one function with unit tests that does one thing).
  6. The tests provided here are not unit tests. They also suffer from having to monkeypatch a lot of thing in the god object because of the suboptimal problem decomposition.

Overall recommendation: try isolating one problem at a time, write a function and that solves that one problem together with tests for that function and file a PR.
Then another one, another one, and so on. After 4-5PRs, you'll be able to compose a small change to a Checkbox job that uses all those functions.

This cannot land in this state.

providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
hanhsuan added a commit to hanhsuan/checkbox that referenced this pull request Jan 15, 2024
This script provides the function to run and validate the process on specific GPU.

There is a bug between kernel 6.3 to 6.5.0.14.
https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug/2047461
Please don't use this script on those kernel versions.
2. Bug of 6.5 kernel is released in proposed kernel 6.5.0.16 and have
   tested. Therefore, removing workaround.
@hanhsuan
Copy link
Contributor Author

  1. The patch for 6.5 kernel has been released to proposed (6.5.0.16). Therefore the workaround isn't needed anymore.

submission
submission of this script without workaround that runs on 6.5.0.16

  1. Move the changes of job and test plan to another PR, and will modify 24.04 test plan only to reduce the impact on SRU.

@hanhsuan hanhsuan requested a review from kissiel January 16, 2024 07:42
hanhsuan added a commit to hanhsuan/checkbox that referenced this pull request Jan 17, 2024
…st plan

only.

1. New jobs that uses prime_offload_tester.py to valid GPU rendering.
2. New test plans that conbimes integrated and discrete GPU into one.
3. Remove unnecessary jobs and test plans after switching to new graphic test strategy.
Copy link
Contributor

@kissiel kissiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function decomposition proposed here introduces unnecessary complexity, and makes both, the logic, and the tests less readable and harder to maintain.

There are some problem around

In line I'm providing comments and suggestions.

providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
providers/base/bin/prime_offload_tester.py Outdated Show resolved Hide resolved
hanhsuan and others added 4 commits March 1, 2024 10:04
2. fix docstring error
3. change default to 20s and the logic in the check_offload
4. change RuntimeError to SystemExit
@hanhsuan hanhsuan requested a review from kissiel March 5, 2024 05:25
@hanhsuan
Copy link
Contributor Author

hanhsuan commented Mar 5, 2024

@kissiel I have fixed the code. Please help me to review, while you have time. Thanks.

Copy link
Contributor

@kissiel kissiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few tweaks that could be applied to this PR, but it's not critical, so we can land this.

Thank you for the very extensive work on this!

@kissiel kissiel dismissed pieqq’s stale review March 18, 2024 16:36

The pxus got removed, so Pierre's request is no longer valid.

@kissiel kissiel merged commit d8063c2 into canonical:main Mar 18, 2024
14 checks passed
binli pushed a commit to binli/checkbox that referenced this pull request Mar 22, 2024
* Changing gpu test strategy to prime/reverse-prime gpu offload without depending on index

For Nvidia GPU, the prime/reverse prime offload is not supported before
version 435.17. Therefore, This new strategy is only for 22.04+.
For backward compatibility, this PR add new test plans for 22.04+ as
follow:
    graphics-gpu-cert-full
        graphics-gpu-cert-automated
        graphics-gpu-cert-manual
    after-suspend-graphics-gpu-cert-full
        after-suspend-graphics-gpu-cert-automated
        after-suspend-graphics-gpu-cert-manual
    monitor-gpu-cert-full
        monitor-gpu-cert-automated
        monitor-gpu-cert-manual
    after-suspend-monitor-gpu-cert-full
        after-suspend-monitor-gpu-cert-automated
        after-suspend-monitor-gpu-cert-manual

And add new python script "prime_offload_tester.py" to execute command
with prime/reverse prime setting for new test jobs as follow:
    Auto test:
        graphics/{index}_auto_glxgears_{product_slug}
        graphics/{index}_auto_glxgears_fullscreen_{product_slug}
    Manual:
        graphics/{index}_valid_glxgears_{product_slug}
        graphics/{index}_valid_glxgears_fullscreen_{product_slug}

* Add more unit test for graphics_card_resource.py and prime_offload_tester.py

* Add one more unit test

* move parse arguments to single function for unit testing

* Fix flake8 error

* 1. Refactory to be more like python
2. add extra method for avoid checking fail by 6.5 kernel bug

* Fix flake8 error

* add executable permission

* 1. Move changes of job and test-plan to another PR
2. Bug of 6.5 kernel is released in proposed kernel 6.5.0.16 and have
   tested. Therefore, removing workaround.

* 1. Move change of jobs and test plan to another PR
2. add more unit tests

* Fix pci BDF format check error

ref:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/pci/early.c?id=refs/tags/v3.12.7#n65
https://wiki.xenproject.org/wiki/Bus:Device.Function_(BDF)_Notation

* Update providers/base/bin/prime_offload_tester.py

Co-authored-by: kissiel <[email protected]>

* Update providers/base/bin/prime_offload_tester.py

Co-authored-by: kissiel <[email protected]>

* Update providers/base/bin/prime_offload_tester.py

Co-authored-by: kissiel <[email protected]>

* 1. move the get clients from check_offload to get_client
2. fix docstring error
3. change default to 20s and the logic in the check_offload
4. change RuntimeError to SystemExit

---------

Co-authored-by: kissiel <[email protected]>
pieqq added a commit that referenced this pull request Apr 10, 2024
* This is part of #586 that includes the changes of job and test plan
only.

1. New jobs that uses prime_offload_tester.py to validate GPU rendering.
2. New test plans that combines integrated and discrete GPU into one.
3. Remove unnecessary jobs and test plans after switching to new graphic test strategy.

* Separate the test cases of laptop and desktop
1. test cases related to prime offload are used for laptops and All-in-Ones (see below)
2. simply test default renderer for desktops

* Add description to let user know the different test targets of prime
offload.

The graphic configuration of AIO devices is similar to laptops.
This kind of configuration is that iGPU will be connected to integrated monitor and some configuration
of AIO come with dGPU. To cover this kind of condition, AIO is added to
prime offload test group.

---------

Co-authored-by: Pierre Equoy <[email protected]>
LiaoU3 pushed a commit to LiaoU3/checkbox that referenced this pull request Apr 17, 2024
…al#942)

* This is part of canonical#586 that includes the changes of job and test plan
only.

1. New jobs that uses prime_offload_tester.py to validate GPU rendering.
2. New test plans that combines integrated and discrete GPU into one.
3. Remove unnecessary jobs and test plans after switching to new graphic test strategy.

* Separate the test cases of laptop and desktop
1. test cases related to prime offload are used for laptops and All-in-Ones (see below)
2. simply test default renderer for desktops

* Add description to let user know the different test targets of prime
offload.

The graphic configuration of AIO devices is similar to laptops.
This kind of configuration is that iGPU will be connected to integrated monitor and some configuration
of AIO come with dGPU. To cover this kind of condition, AIO is added to
prime offload test group.

---------

Co-authored-by: Pierre Equoy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants