Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt token for WMAgent stage-in/stage-out #12144

Open
amaltaro opened this issue Oct 14, 2024 · 10 comments · May be fixed by #12196 or #12218
Open

Adopt token for WMAgent stage-in/stage-out #12144

amaltaro opened this issue Oct 14, 2024 · 10 comments · May be fixed by #12196 or #12218

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Oct 14, 2024

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
Similarly to this ticket #11199 , we need to adopt token for the WMAgent payload. In other words, instead of using X509-based stage-in and stage-out auth/authz, we should adopt a token solution for this storage communication.

Describe the solution you'd like
Support token in WMAgent for stage-in / stage-out.

Tokens in the grid jobs will only be available once we configure
a) access to token in the agent node;
b) management of the token in the agent node;
c) propagation of the token by htcondor and WMAgent job description;
d) use of token by the grid job (stage-in / stage-out).
Unless we have all this setup in place, we shouldn't have production jobs accessing tokens during the job runtime.

As a result, that requires at least the following developments:

  • setup of HTCondor to propagate the relevant token to the job condor shadow
  • update SimpleCondorPlugin to define token in the job classad
  • have the bearer token defined in the job environment (to be picked up by CMSSW for stage-in, and read it for stage-out)
  • then, improve the debugging information with the token-relevant information

Describe alternatives you've considered
If token-based auth/authz fails, do we want to fallback to x509 ?

Additional context
None

@anpicci
Copy link
Contributor

anpicci commented Oct 16, 2024

@amaltaro I took the liberty to update the description of the issue, according to the discussion in #12081.
@stlammel , feel free to provide additional comments and suggestions here, rather than in the PR linked above, at your convenience.

@stlammel
Copy link

I think we want to make the stage-out token safe now, i.e. in case a token is in the environment and transfer with token doesn't work, the stage-out doesn't fails. (Right now, if HTCondor makes a token available but token-based transfer doesn't work stage-out may fail.)
Something like:

print date/time
print hostname
print GFAL*, PYTHON*, and LD_* environment
print gfal-copy location
print PFNs

gfal-copy ...
if ( rc == 0 ) done

if ( X509_USER_PROXY is set )
sleep 3 sec
in subprocess {
unsetenv BEARER_TOKEN
unsetenv BEARER_TOKEN_FILE
gfal-copy -v ...
if ( rc == 0 ) done
}
if ( BEARER_TOKEN or BEARER_TOKEN_FILE is set )
sleep 3 sec
in subprocess {
unsetenv X509_USER_PROXY
gfal-copy -v ...
if ( rc == 0 ) done
}

sleep 15 min
print date/time
voms-proxy-info -all
httokendecode
gfal-copy -vvv ...

(token information may change underneath us, so we should print it just before the debug gfal-copy).
'just a thought/suggestion.
Thanks,

  • Stephan

@amaltaro
Copy link
Contributor Author

Right! We will add the necessary safety mechanism once we integrate tokens in the grid jobs. Thanks Stephan.

@anpicci anpicci self-assigned this Oct 22, 2024
@anpicci anpicci moved this from ToDo to In Progress in WMCore quarterly developments Oct 22, 2024
@khurtado khurtado self-assigned this Nov 18, 2024
@khurtado khurtado linked a pull request Dec 5, 2024 that will close this issue
@khurtado
Copy link
Contributor

khurtado commented Dec 5, 2024

Issues I have noticed so far:

  • When jobs are submitted via the condor python bindings, inside the WMAgent container, the job is submitted, but /usr/bin/condor_vault_storer does not seem to be triggered.

I am currently working that around by:

  1. Executing a condor_submit with a test job once from the host
    Token seems to stay and refresh afterwards.

If cms_readonly scope is used, this does not happen and we need a manual refresh, but for production, we don't need any scope.

@anpicci
Copy link
Contributor

anpicci commented Dec 6, 2024

maybe @stlammel can comment more on what @khurtado reported

@stlammel
Copy link

stlammel commented Dec 6, 2024

We have not used the python binding in the initial tests. If there is a bug, we'll need to make a test case and submit a bug report.

  • Stephan

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 6, 2024

We have not used the python binding in the initial tests. If there is a bug, we'll need to make a test case and submit a bug report.

I am not sure I understand this comment.
When we were testing these functionalities in a Fermilab agent, we did test it with a workflow going through the WM system (testbed), which also includes job submission by the agent (hence using python bindings).

It is important to notice though that we were using a CC7 node with RPM based deployment. Maybe this is the difference that you are trying to make here? Is the docker solution imposing limitations to this?

@stlammel
Copy link

stlammel commented Dec 6, 2024

I didn't know/realize this Alan. I was only aware of the condor_submit test. Then this is likely a config issue and not bug.
Thanks,

  • Stephan

@khurtado
Copy link
Contributor

khurtado commented Dec 6, 2024

@amaltaro Submission through the python bindings work as long as there is a token present already. This can be achieved by creating the token by hand, or having a client that does this for you (like the host condor_submit).

If the tests at FNAL started with a condor_submit first, and a job submission via WMAgent after, then this issue would not have shown up at all.

The python bindings alone don't seem to invoke any token generation script, but that is probably not a huge deal because the first job usually prompts you to an oauth website for the authentication, and since we submit the jobs automatically via JobSubmitter, we do need to do this interactively beforehand anyway.

I am just noting that as things are now, we need to generate a token at the host, either by hand or with a simple condor_submit test job, in order for WMAgent to submit and use the tokens later on.

I don't think we need to dig too much into this at this point though, because this part of the token generation is going to change eventually, so that this is done without an interactive URL that the user needs to click and authenticate on.

@stlammel
Copy link

stlammel commented Dec 9, 2024

I thought about Kenyi's message about condor_vault_storer to being executed on a submit via python API. The initial authorization forwarding requires interactive execution and the device code URL might be lost on stdin in case of python API.
So, it might be called/executed but prompt for user action not getting through/being lost.
Thanks,

  • Stephan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment