Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16557 test: Add debug to NvmeEnospace ftest #15559

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

knard38
Copy link
Contributor

@knard38 knard38 commented Dec 4, 2024

Issue Description

The functional test nvme/enospace.py is sporadically failing due to unexpected ENOSPACE errors, as outlined in the DAOS-16557 ticket.

Current Challenge

The INFO log level used in the functional test is not sufficiently verbose to determine why aggregation is not triggered when the test fails.

Potential workarounds have limitations:

  • Increasing log verbosity to DEBUG: Generates excessive logs, making it unsuitable for CI environments.
  • Running tests locally with higher log verbosity: Not feasible since the issue cannot be reproduced locally at this time.

Proposed Solution

This PR introduces telemetry logs of the rebuild state at each major step of the test to gain better insights into why aggregation occasionally fails. These telemetry logs aim to provide enough diagnostic information without overwhelming the CI with excessive log data.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

github-actions bot commented Dec 4, 2024

Ticket title is 'nvme/enospace.py:NvmeEnospace.test_enospace_lazy_with_fg - dfs_write write failed No space left on device'
Status is 'In Progress'
Labels: '2.6.1rc1,ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-16557

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15559/1/execution/node/1342/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15559/3/testReport/

@knard38 knard38 force-pushed the ckochhof/fix/master/daos-16557-v2 branch 2 times, most recently from aabbb62 to 75c7592 Compare January 31, 2025 06:04
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15559/5/testReport/

Add aggregation debugging information on the state of the pool to allow
debugging if ENOSPACE error happens unexpectedly.

Quick-Functional: true
Test-tag: NvmeEnospace
Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@knard38 knard38 force-pushed the ckochhof/fix/master/daos-16557-v2 branch from 75c7592 to 7d40066 Compare January 31, 2025 07:46
@knard38 knard38 marked this pull request as ready for review January 31, 2025 15:14
@knard38 knard38 requested review from a team as code owners January 31, 2025 15:14
daltonbohning
daltonbohning previously approved these changes Jan 31, 2025
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright GHA is complaining since you are committing with intel instead of hpe email

src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
ravalsam
ravalsam previously approved these changes Jan 31, 2025
Copy link
Contributor

@ravalsam ravalsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. But lot of stdout for those metrics output. Not sure if we have a way to not to print? Table format is looking good and may be enough in the log. But not a major issue..

@knard38 knard38 self-assigned this Jan 31, 2025
@knard38
Copy link
Contributor Author

knard38 commented Jan 31, 2025

Looks good. But lot of stdout for those metrics output. Not sure if we have a way to not to print? Table format is looking good and may be enough in the log. But not a major issue..

TBH, at this point of my investigation on this aggregation issue I am not sure of which aggregation metrics will be really relevant.
Thus, I decided to put as much as I could for being able to investigate this issue, without overflowing the CI with logs and also preserving the test behaviour.
Indeed, the big dilemma with this issue is that it should probably needs to have the DEBUG log to be able to properly investigate on it (at the cost of overflowing the CI with logs).
Moreover, I am pretty sure that the issue will probably never happen with so much logs as it will have an influence on the behaviour of the aggregation process :'-(

@ravalsam
Copy link
Contributor

Looks good. But lot of stdout for those metrics output. Not sure if we have a way to not to print? Table format is looking good and may be enough in the log. But not a major issue..

TBH, at this point of my investigation on this aggregation issue I am not sure of which aggregation metrics will be really relevant. Thus, I decided to put as much as I could for being able to investigate this issue, without overflowing the CI with logs and also preserving the test behaviour. Indeed, the big dilemma with this issue is that it should probably needs to have the DEBUG log to be able to properly investigate on it (at the cost of overflowing the CI with logs). Moreover, I am pretty sure that the issue will probably never happen with so much logs as it will have an influence on the behaviour of the aggregation process :'-(

yes, this is tricky one.

One more option is that, we can set the DEBUG mode on live server using dmg server set-logmasks when test fails. May be those logs may give something?

Integrate reviewers comments:
- Fix of newlines
- Pass pool attributes as function argument
- Compute the columns size into the display_table() method

Quick-Functional: true
Test-tag: NvmeEnospace
Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
@knard38 knard38 dismissed stale reviews from ravalsam and daltonbohning via 6445498 January 31, 2025 21:19
@knard38
Copy link
Contributor Author

knard38 commented Jan 31, 2025

Looks good. But lot of stdout for those metrics output. Not sure if we have a way to not to print? Table format is looking good and may be enough in the log. But not a major issue..

TBH, at this point of my investigation on this aggregation issue I am not sure of which aggregation metrics will be really relevant. Thus, I decided to put as much as I could for being able to investigate this issue, without overflowing the CI with logs and also preserving the test behaviour. Indeed, the big dilemma with this issue is that it should probably needs to have the DEBUG log to be able to properly investigate on it (at the cost of overflowing the CI with logs). Moreover, I am pretty sure that the issue will probably never happen with so much logs as it will have an influence on the behaviour of the aggregation process :'-(

yes, this is tricky one.

One more option is that, we can set the DEBUG mode on live server using dmg server set-logmasks when test fails. May be those logs may give something?

Not sure to understand as setting DEBUG mode when the test has failed will be too late.
From my understanding the relevant DEBUG logs should be before the test fails: for example some logs indicating why the aggregation is not triggered before the unexpected overflow.

daltonbohning
daltonbohning previously approved these changes Jan 31, 2025
src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
src/tests/ftest/nvme/enospace.py Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15559/7/execution/node/770/log

shimizukko
shimizukko previously approved these changes Feb 1, 2025
"engine_pool_vos_aggregation_epr_duration_max",
"engine_pool_vos_aggregation_epr_duration_mean",
"engine_pool_vos_aggregation_epr_duration_min",
# "engine_pool_vos_aggregation_epr_duration_samples",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need these commented out names?

Copy link
Contributor Author

@knard38 knard38 Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept them as a reminder if it will be needed to have more stats to debug the aggregation issues (if it appears again). However, I am pretty sure that they will not really bring more relevant infos to solve this aggregation issue.
Thus, I will remove them.

  • Remove useless commented metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept them as a reminder if it will be needed to have more stats to debug the aggregation issues (if it appears again). However, I am pretty sure that they will not really bring more relevant infos to solve this aggregation issue. Thus, I will remove them.

  • Remove useless commented metrics

Fixed with commit 115170b

Integrate reviewers comments:
- Update param type
- Remove useless commented metrics

Quick-Functional: true
Test-tag: NvmeEnospace
Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
…/daos-16557-v2

Quick-Functional: true
Test-tag: NvmeEnospace
Required-githooks: true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants