Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSRuleCleaner deleting rules for rejected wfs that are being used by other workflows. #12164

Open
hassan11196 opened this issue Nov 12, 2024 · 3 comments
Assignees

Comments

@hassan11196
Copy link
Member

Impact of the bug
Inform which systems get affected by this bug. Which agent(s)? Which central service(s)?
MSRuleCleaner, MSTransferor, ReqMgr2

Describe the bug

There are three wfs concerning this situation in order of injection to RegMgr2.

wf1 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_121809_606

new: 2024-10-24 12:18:09 UTC 
assignment-approved: 2024-10-24 12:18:14 UTC
rejected: 2024-10-24 12:22:05 UTC 
rejected-archived: 2024-10-26 16:56:46 UTC

wf2 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_211828_4633

new: 2024-10-24 21:18:28 UTC 
assignment-approved: 2024-10-24 21:18:35 UTC 
assigned: 2024-10-24 22:02:27 UTC 
staging: 2024-10-24 22:04:03 UTC 
aborted: 2024-11-05 16:20:00 UTC

wf3 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_Prompt_GTValidation_RelVal_2024D_241024_212236_6565

new: 2024-10-24 21:22:36 UTC
assignment-approved: 2024-10-24 21:22:42 UTC 
assigned: 2024-10-24 22:03:10 UTC 
staging: 2024-10-24 22:04:06 UTC 
aborted: 2024-11-05 14:02:36 UTC 

Now Keep in mind that all these wfs require the same input blocks i.e 

[{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#9d364486-0ad1-4207-812f-29f23249e197'},
{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#4a41434d-4d35-4c12-b6c3-037e172642d8'}]

and that
wf1 was rejected 4 mins after injection, before it was even assigned, and therefore it never went to staging state and thus no ms-transferor entry or rule creation in rucio.

Now Nearly, 9 hours later wf2 is injected and shortly (~mins) after wf3 is also injected

The follwing rules are created for wf2 by ms-transferor 

['3353ed719bf446d38ccb739ba1076642', '6595ca5e42204e4c913e549e0ba24205']

transferor - 1st Create (This aligns with rucio block rules history I pasted above)

reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:02,376:INFO:MSTransferor: Creating rule for workflow pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_211828_4633 with 2 DIDs in container /JetMET0/Run2024D-v1/RAW, RSEs: T1_US_FNAL_Disk, grouping: ALL
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log:2024-10-24 22:04:02,484:INFO:MSTransferor: Rules successful created for /JetMET0/Run2024D-v1/RAW : ['3353ed719bf446d38ccb739ba1076642', '6595ca5e42204e4c913e549e0ba24205']

Now, MS-transferor processes wf3, but the rules are already created. So no new rules are created and MS-transferors keeps the same rule ids

transferor - 2nd Try Create 

reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,676:INFO:MSTransferor: Creating rule for workflow pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_Prompt_GTValidation_RelVal_2024D_241024_212236_6565 with 2 DIDs in container /JetMET0/Run2024D-v1/RAW, RSEs: T1_US_FNAL_Disk, grouping: ALL
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log:2024-10-24 22:04:05,740:WARNING:Rucio: Resolving duplicate rules and separating every DID in a new rule...
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,790:WARNING:Rucio: Found duplicate rule for account: wmcore_transferor
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-, rseExp: T1_US_FNAL_Disk
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-dids: [{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#9d364486-0ad1-4207-812f-29f23249e197'}, {'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#4a41434d-4d35-4c12-b6c3-037e172642d8'}]
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,849:WARNING:Rucio: Found duplicate rule for account: wmcore_transferor
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-, rseExp: T1_US_FNAL_Disk 

Everything goes smoothly, wf2 and wf3 are staging, BUT
MS-Rulecleanor picks up rejected wf1 sees that it has rucio rules and it deletes them.

rulecleaner - 1st Delete

reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:27,830:DEBUG:MSRuleCleaner: Dispatching workflow: pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_121809_606 
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:28,189:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 6595ca5e42204e4c913e549e0ba24205 
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:28,217:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "DELETE /rules/6595ca5e42204e4c913e549e0ba24205 HTTP/1.1" 200 0
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log:2024-10-24 22:14:28,217:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 3353ed719bf446d38ccb739ba1076642 
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:28,236:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "DELETE /rules/3353ed719bf446d38ccb739ba1076642 HTTP/1.1" 200 0

The rest two deletes were for these 2 wfs, 
pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_092626_4354
pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_101534_7605
They also share the same input Blocks as above 3 wfs but difference is that they were placed at CERN.

rulecleaner - to Delete again 1 second later

reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log:2024-10-24 22:14:29,027:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 3353ed719bf446d38ccb739ba1076642 

rulecleaner - Delete again 3 second later again

reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log:2024-10-24 22:14:32,810:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 3353ed719bf446d38ccb739ba1076642  

Conclusion, wf1's rejection lead to rules deletion for wf1 and wf2 because ** MS-Rulecleanor does not check if the rule is being used by any other workflow or not.
I went through the code and found that it only checks the current wfs state, https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py#L323

How to reproduce it
Steps to reproduce the behavior:
submit 1 request requiring a certain block of data, reject it and resubmit another request requiring the same blocks in the timeframe where the oriignal rucio rules are not deleted.

Expected behavior
A clear and concise description of what you expected to happen.
MSRuleCleaner to not delete rules being used by active workflows.

Additional context and error message
Add any other context about the problem here, like error message and/or traceback. You might want to use triple back ticks to properly format it.
https://its.cern.ch/jira/browse/CMSPROD-227

@hassan11196
Copy link
Member Author

I will follow up with a pull requst for this issue.

@hassan11196 hassan11196 self-assigned this Nov 12, 2024
@hassan11196
Copy link
Member Author

Hi @amaltaro
Here are the two points we discussed in the WMCore Dev meeting:

  1. Instead of deleting the rule (which deletes the rule after 1 hour), Set the expiration time to expire in 0 seconds.
    In MS-Transferor, we can change it to only re-use rules without a expiration time

  2. Refresh the WMCache | WMStats Global locks more often. Due to a large amount of wfs in the system the duty cycle of MSRuleCleaner can be very long ~2 hours, resulting in inconsistencies.

@hassan11196
Copy link
Member Author

@haozturk has also created a issue in CMSRucio to discuss recreation of recently deleted rule due to the unique constraint

rucio/rucio#7168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ToDo
Development

No branches or pull requests

1 participant