-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16557 test: Add debug to NvmeEnospace ftest #15559
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'nvme/enospace.py:NvmeEnospace.test_enospace_lazy_with_fg - dfs_write write failed No space left on device' |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15559/1/execution/node/1342/log |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15559/3/testReport/ |
aabbb62
to
75c7592
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15559/5/testReport/ |
Add aggregation debugging information on the state of the pool to allow debugging if ENOSPACE error happens unexpectedly. Quick-Functional: true Test-tag: NvmeEnospace Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
75c7592
to
7d40066
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyright GHA is complaining since you are committing with intel instead of hpe email
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. But lot of stdout for those metrics output. Not sure if we have a way to not to print? Table format is looking good and may be enough in the log. But not a major issue..
TBH, at this point of my investigation on this aggregation issue I am not sure of which aggregation metrics will be really relevant. |
yes, this is tricky one. One more option is that, we can set the DEBUG mode on live server using |
Integrate reviewers comments: - Fix of newlines - Pass pool attributes as function argument - Compute the columns size into the display_table() method Quick-Functional: true Test-tag: NvmeEnospace Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Not sure to understand as setting DEBUG mode when the test has failed will be too late. |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15559/7/execution/node/770/log |
src/tests/ftest/nvme/enospace.py
Outdated
"engine_pool_vos_aggregation_epr_duration_max", | ||
"engine_pool_vos_aggregation_epr_duration_mean", | ||
"engine_pool_vos_aggregation_epr_duration_min", | ||
# "engine_pool_vos_aggregation_epr_duration_samples", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need these commented out names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept them as a reminder if it will be needed to have more stats to debug the aggregation issues (if it appears again). However, I am pretty sure that they will not really bring more relevant infos to solve this aggregation issue.
Thus, I will remove them.
- Remove useless commented metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept them as a reminder if it will be needed to have more stats to debug the aggregation issues (if it appears again). However, I am pretty sure that they will not really bring more relevant infos to solve this aggregation issue. Thus, I will remove them.
- Remove useless commented metrics
Fixed with commit 115170b
Integrate reviewers comments: - Update param type - Remove useless commented metrics Quick-Functional: true Test-tag: NvmeEnospace Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
115170b
…/daos-16557-v2 Quick-Functional: true Test-tag: NvmeEnospace Required-githooks: true
Issue Description
The functional test
nvme/enospace.py
is sporadically failing due to unexpected ENOSPACE errors, as outlined in the DAOS-16557 ticket.Current Challenge
The INFO log level used in the functional test is not sufficiently verbose to determine why aggregation is not triggered when the test fails.
Potential workarounds have limitations:
Proposed Solution
This PR introduces telemetry logs of the rebuild state at each major step of the test to gain better insights into why aggregation occasionally fails. These telemetry logs aim to provide enough diagnostic information without overwhelming the CI with excessive log data.
Gatekeeper: