Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BROKEN - Implementation of the token-safe retry logic for gfal #12191

Closed
wants to merge 25 commits into from

Conversation

anpicci
Copy link
Contributor

@anpicci anpicci commented Nov 29, 2024

Fixes #12144

Status

not-tested

Description

This PR introduces the retry logic proposed by @stlammel for handling the possible failures with token authentication when used with gfal-cp. To be extended to the xrootd protocol.
More details in the issue description

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@anpicci
Copy link
Contributor Author

anpicci commented Nov 29, 2024

@khurtado FYI, we could start testing this in a testbed node instrumented with token authentication, while we extend the implementation to the xrootd protocol

self.setups = "env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '{}'"
elif auth_method == "TOKEN":
self.setups = "env -i BEARER_TOKEN=$(cat $BEARER_TOKEN_FILE) JOBSTARTDIR=$JOBSTARTDIR bash -c '{}'"
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro do we want to allow gfal-cp to run even when no authentication method is specified?

except StageOutError as ex:
msg = "Attempt {} to stage out failed.\n".format(retryCount)
msg = "Attempt {} to stage out failed with default setup.\n".format(retryCount)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro @stlammel as default, do we want to set the BEARER_TOKEN env var to force trying with token authentication?

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 27 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/112/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 27 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/113/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 2, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 29 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/115/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Dec 2, 2024

test this

@amaltaro
Copy link
Contributor

amaltaro commented Dec 2, 2024

test this please

@dmwm-bot
Copy link

dmwm-bot commented Dec 2, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 80 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/118/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Dec 2, 2024

@amaltaro I am not sure that the new failing unit test is due to some changes of mine

@dmwm-bot
Copy link

dmwm-bot commented Dec 2, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 80 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/117/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

amaltaro commented Dec 2, 2024

These are the newly failing unit tests:

    WMCore_t.Services_t.DBS_t.DBSConcurrency_t.DBSConcurrencyTest:testGetBlockInfoList changed from success to error
    WMCore_t.Storage_t.Backends_t.GFAL2Impl_t.GFAL2ImplTest:testInit changed from success to failure

which I am not able to find under the Test Result section in
https://cmssdt.cern.ch/dmwm-jenkins/job/WMCore-PR-Report/118/#showFailuresLink

@d-ylee @khurtado am I doing something wrong? How do I get to the details of the failing unit tests? Do you understand why the 2 reported error/failures don't show up in the list of 51 failing unit tests?

@d-ylee
Copy link
Contributor

d-ylee commented Dec 3, 2024

@amaltaro This is interesting. The error shows up in 117, but not in 118. https://cmssdt.cern.ch/dmwm-jenkins/job/WMCore-PR-Report/117/#showFailuresLink

Looking at the GitHub comment history, it looks like both you and @anpicci asked Jenkins to do the test and also made a new commit at around the same time, so I am assuming 117 and 118 are from both of your comments.

@amaltaro
Copy link
Contributor

amaltaro commented Dec 3, 2024

test this please

@dmwm-bot
Copy link

dmwm-bot commented Dec 3, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 24 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 80 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/119/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Dec 3, 2024

@amaltaro @d-ylee @khurtado FYI, my last commit fixed the error affecting WMCore_t.Storage_t.Backends_t.GFAL2Impl_t.GFAL2ImplTest:testInit, since I fixed it by running unit test in a local environment

@d-ylee
Copy link
Contributor

d-ylee commented Dec 5, 2024

retest this please

@dmwm-bot
Copy link

dmwm-bot commented Dec 5, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 80 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/148/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 7 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 80 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/200/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 133 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/202/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 133 comments to review
  • Pycodestyle check: succeeded
    • 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/204/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 7 new failures
  • Python3 Pylint check: succeeded
    • 133 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/207/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 134 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/208/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 1 warnings
    • 134 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/209/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 134 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/210/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 134 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/211/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 134 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/217/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 warnings
    • 134 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/220/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Jan 5, 2025

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 5 warnings
    • 183 comments to review
  • Pycodestyle check: succeeded
    • 26 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/247/artifact/artifacts/PullRequestReport.html

@anpicci anpicci changed the title Implementation of the token-safe retry logic for gfal BROKEN - Implementation of the token-safe retry logic for gfal Jan 5, 2025
@anpicci
Copy link
Contributor Author

anpicci commented Jan 5, 2025

I'm closing this PR in favor of #12218, to avoid messed changes with Kenyi's PR

FYI @amaltaro @khurtado

@anpicci anpicci closed this Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adopt token for WMAgent stage-in/stage-out
5 participants