-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMS@Home jobs not being cleaned up #10751
Comments
Actually, the problem has always been there, even before Phedex was superseded. |
Hello again. Does anyone have the time to look at this at the moment? We've moved on a fair bit in the interim and are quite close to being "production ready". |
@ivanreid Hi Ivan, apologies for the delay on this, but other issues that actually involve production services are taking precedence over this one. Can you help me on understanding and clarifying this issue? Some of the questions I have is:
Looking into: ireid_TC_SLC7_IDR_CMS_Home_230102_183019_9782 I see that:
|
@amaltaro That's OK Alan, I know there are more important issues.
Correct. The phedex-node is 'phedex-node value="T3_CH_CMSAtHome"'
Yes, precisely. We've never understood why the LogCollect for Merge GenSimFull files tries to run on T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT. Presumably it's a cascading effect down the siteconfig chain. Cleanup for unmerged GenSimFull files fails because it tries to delete from EOS rather than DataBridge. We suspect that this is because we are the only site where the unmerged files and merged files are in different storage locations, but are puzzled that GenSimFullMerge gets the source location right while cleanup doesn't. |
Thanks Ivan!
We need to check the logs, but I think the reason is the following: Merge jobs run at T3_CH_CMSAtHome, which use and report T2_CH_CERN as its phedex-node name (RSE). So, the logArchive file created by the Merge job is theoretically available at T2_CH_CERN. By design, the agent does the right thing, which is, sending a job to the sites associated to a given storage/RSE.
these Cleanup jobs are executed at T3_CH_CMSAtHome, because their input files is meant to be available at the "storage" T3_CH_CMSAtHome. Now the question is, why Merge job on T3_CH_CMSAtHome works; but Cleanup on T3_CH_CMSAtHome does not? They have exactly the same use case, both reading data in - from the DataBridge(?) Maybe CMSSW uses the right path - read from the storage.xml - while the Cleanup jobs don't. That is just a guess though. |
I once tried to follow the logic of CMSSW reading/parsing the siteconf files -- it seemed to make a lot of hard-wired assumptions. |
@amaltaro : It's been suggested that we look into this ourselves since the WMCore group is overloaded, and make suggestions as to possible solutions. I'm interested in why the Merge processes successfully read from DataBridge, but the CleanupUnmerged looks for the files on EOS instead. The code is fairly labyrinthine to the uninitiated, so could you please tell us which modules are responsible for the merging and the cleanup? Thanks. |
@ivanreid apologies for the delay on getting back to this thread. I recompiled everything that we discussed so far to better understand this/these problems. Here is a fresh summary of the issue:
Looking into: ireid_TC_SLC7_IDR_CMS_Home_230102_183019_9782, we can observe the following:
I copied a recent log for a cleanup job from vocms0267, you can find it here: LogCollect for MergeGenSimFull fail because their input data is available at T2_CH_CERN storage, hence jobs are submitted to the T2_CH_CERN computing sites. Given that this pool has no access to those resources, they sit there idle until they get killed by the agent for a too long period in Idle status (5 days). The module responsible for deleting files from the storage (Cleanup jobs) is: Please let us know in case further assistance is needed. And thank you for looking into it! |
@amaltaro Could you please also make available a log file from a successful merge job? Thanks. |
@ivanreid sure, it's now available under the same web area as mentioned above. |
My first impression on reading the log files is that DeleteManager.py gets the PNN from the siteconf local-stage-out (lines 80-124), which we have pointed to T2_CH_CERN, but the merge jobs use the event-data instead (i.e. the data-bridge), lines 189-207 in SiteLocalConfig.py. So the cleanup jobs are ignoring event-data. |
I haven't been able to work out exactly where/how the event-data tfc is passed to the merge jobs, the python code is rather more advanced than my pyRoot coding, but it obviously is since the right files are merged. The question probably now becomes, should DeleteManager just blithely assume that files to be deleted are residing in LocalStageOut? |
OK, so the merge jobs are actually CMSSW (cmsRun) jobs, so that might explain why they pick up the event-data tfc. By running the following awk code So, it's probable that changing the cleanup jobs to use the event-data tfc might affect other sites as well. We could possibly put in a conditional section depending on the site name, but that seems inelegant to me. |
For discussion, here's a possible but inelegant solution to our immediate problem. Treat it as pseudo-code as I'm not sure if I've interpreted all the classes correctly, or if their members are available at the time. It also ignores possible problems downstream (e.g. when the unmerged mergeLog files in eoscms.cern.ch//eos/cms/store/unmerged/logs/prod/ are to be cleaned up). [lxplus710:WMCore] > git diff |
On the basis that "it's easier to ask for forgiveness than permission", Federica and I applied the suggested patch on our agent on vocms0267, and started a new workflow. However, my fears that I might have got class membership/hierarchy wrong have been borne out. The cleanup jobs are now failing with the error "'DeleteMgr' object has no attribute 'siteName'". |
@ivanreid thank you for this investigation so far. I am not saying this is the right way to resolve this issue, but if I follow the logic of the code, we can say. If there is no which ends up creating an instance of
and that is the object returned to the DeleteMgr module. Inspecting the SiteLocalConfig object, we can see siteName here:
So, your if statement above should likely be replaced by:
This will not work though for scenarios where DeleteMgr is invoked with an override setup: WMCore/src/python/WMCore/Storage/DeleteMgr.py Line 126 in fea1de4
which I do not know by heart what are those, but I am sure it's either used by specific tasks in central production, or by the T0 team. |
@amaltaro Yes, I saw there was an override provision, but haven't tried to track down where it's used. I had already changed the if statement as you suggested, now we're back to trying to delete from eos again. I've asked Federica to put more diagnostics into our patched DeleteMgr.py to try to better understand what is happening. I've also explicitly put T3_CH_CMSAtHome into our event-data path in the site-config rather than 'local', but that doesn't appear to be having any affect. |
Unfortunately, the suggested patch does not work, because the variable |
Hi Alan, do you know if there is already a way to force the usage of fallback information instead of the info in locatStageOut of self.siteCfg obj for the delete command? |
Dear Alan,
|
@fanzago Federica, thank you for this investigation. Even if we don't merge your contribution into master, it's good to have it in the format of a PR because we can easily patch the CMS@Home agent. For a more permanent fix, we can try to make it a priority for the upcoming quarter and have a closer look into it. We should get started with that planning in the coming days. |
Impact of the bug
Cleanup from the CMS@Home jobs is not being done properly since turning off Phedex.
Describe the bug
In an email from Ivan Reid on the 12th August 2021, he describes the issue. Cleanup jobs are trying to delete files from EOS instead of the DataBridge. Similarly the merge log files have the same problem. The DataBridge storage does get garbage collected after 30 days, but this space could fill up before 30 days is reached.
How to reproduce it
Steps to reproduce the behavior:
Expected behavior
Change the (hardcoded?) target of the cleanup jobs?
Additional context and error message
From Ivan's email:
(removing triangular brackets due to syntax requirements)
Our current understanding is that in
/cvmfs/cms.cern.ch/SITECONF/T3_CH_CMSAtHome/JobConfig/site-local-config.xml
we have in stage-out:
se-name value="srm-eoscms.cern.ch"/
phedex-node value="T2_CH_CERN"/
while in event-data we have
catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=http"/
in which
lfn-to-pfn protocol="http" path-match="/+store/(.*)"
result="http://data-bridge.cern.ch/myfed/cms-output/store/$1"/>
The text was updated successfully, but these errors were encountered: