Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMS@Home jobs not being cleaned up #10751

Open
KatyEllis opened this issue Aug 13, 2021 · 21 comments · May be fixed by #11513
Open

CMS@Home jobs not being cleaned up #10751

KatyEllis opened this issue Aug 13, 2021 · 21 comments · May be fixed by #11513

Comments

@KatyEllis
Copy link

KatyEllis commented Aug 13, 2021

Impact of the bug
Cleanup from the CMS@Home jobs is not being done properly since turning off Phedex.

Describe the bug
In an email from Ivan Reid on the 12th August 2021, he describes the issue. Cleanup jobs are trying to delete files from EOS instead of the DataBridge. Similarly the merge log files have the same problem. The DataBridge storage does get garbage collected after 30 days, but this space could fill up before 30 days is reached.

How to reproduce it
Steps to reproduce the behavior:

Expected behavior
Change the (hardcoded?) target of the cleanup jobs?

Additional context and error message
From Ivan's email:
(removing triangular brackets due to syntax requirements)

Our current understanding is that in
/cvmfs/cms.cern.ch/SITECONF/T3_CH_CMSAtHome/JobConfig/site-local-config.xml
we have in stage-out:
se-name value="srm-eoscms.cern.ch"/
phedex-node value="T2_CH_CERN"/
while in event-data we have
catalog url="trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=http"/
in which
lfn-to-pfn protocol="http" path-match="/+store/(.*)"
result="http://data-bridge.cern.ch/myfed/cms-output/store/$1"/>

@ivanreid
Copy link

Actually, the problem has always been there, even before Phedex was superseded.
Failures can be verified at https://cmsweb-testbed.cern.ch/wmstats/index.html, filtering the workflow with "ireid_TC_SLC7".

@ivanreid
Copy link

Hello again. Does anyone have the time to look at this at the moment? We've moved on a fair bit in the interim and are quite close to being "production ready".

@amaltaro
Copy link
Contributor

@ivanreid Hi Ivan, apologies for the delay on this, but other issues that actually involve production services are taking precedence over this one.

Can you help me on understanding and clarifying this issue? Some of the questions I have is:

  1. I understand that Production/Processing jobs run on T3_CH_Volunteer resources, right? What is the phedex-node name defined in site-local-config.xml?
  2. all the remaining job types (tasks) run on T3_CH_CMSAtHome resources, right? And I see that the phedex-node name points to T2_CH_CERN. Is it correct?

Looking into: ireid_TC_SLC7_IDR_CMS_Home_230102_183019_9782

I see that:

  • LogCollect for unmerged GenSimFull files are succeeding (thus, reading location that gets provided by T3_CH_Volunteer jobs).
  • LogCollect for Merge GenSimFull files are NOT succeeding (thus, reading location that gets provided by T3_CH_CMSAtHome). Jobs are submitted to T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT.
  • Cleanup for unmerged GenSimFull files are NOT succeeding (thus, reading location that gets provided by T3_CH_Volunteer). Jobs are submitted to T3_CH_CMSAtHome.

@ivanreid
Copy link

@amaltaro That's OK Alan, I know there are more important issues.

I understand that Production/Processing jobs run on T3_CH_Volunteer resources, right? What is the phedex-node name defined in site-local-config.xml?

Correct. The phedex-node is 'phedex-node value="T3_CH_CMSAtHome"'

all the remaining job types (tasks) run on T3_CH_CMSAtHome resources, right? And I see that the phedex-node name points to T2_CH_CERN. Is it correct?

Yes, precisely.

We've never understood why the LogCollect for Merge GenSimFull files tries to run on T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT. Presumably it's a cascading effect down the siteconfig chain.

Cleanup for unmerged GenSimFull files fails because it tries to delete from EOS rather than DataBridge. We suspect that this is because we are the only site where the unmerged files and merged files are in different storage locations, but are puzzled that GenSimFullMerge gets the source location right while cleanup doesn't.

@amaltaro
Copy link
Contributor

Thanks Ivan!

We've never understood why the LogCollect for Merge GenSimFull files tries to run on T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT. Presumably it's a cascading effect down the siteconfig chain.

We need to check the logs, but I think the reason is the following: Merge jobs run at T3_CH_CMSAtHome, which use and report T2_CH_CERN as its phedex-node name (RSE). So, the logArchive file created by the Merge job is theoretically available at T2_CH_CERN. By design, the agent does the right thing, which is, sending a job to the sites associated to a given storage/RSE.

Cleanup for unmerged GenSimFull files fails because it tries to delete from EOS rather than DataBridge. We suspect that this is because we are the only site where the unmerged files and merged files are in different storage locations, but are puzzled that GenSimFullMerge gets the source location right while cleanup doesn't.

these Cleanup jobs are executed at T3_CH_CMSAtHome, because their input files is meant to be available at the "storage" T3_CH_CMSAtHome.

Now the question is, why Merge job on T3_CH_CMSAtHome works; but Cleanup on T3_CH_CMSAtHome does not? They have exactly the same use case, both reading data in - from the DataBridge(?) Maybe CMSSW uses the right path - read from the storage.xml - while the Cleanup jobs don't. That is just a guess though.

@ivanreid
Copy link

I once tried to follow the logic of CMSSW reading/parsing the siteconf files -- it seemed to make a lot of hard-wired assumptions.

@ivanreid
Copy link

@amaltaro : It's been suggested that we look into this ourselves since the WMCore group is overloaded, and make suggestions as to possible solutions. I'm interested in why the Merge processes successfully read from DataBridge, but the CleanupUnmerged looks for the files on EOS instead. The code is fairly labyrinthine to the uninitiated, so could you please tell us which modules are responsible for the merging and the cleanup? Thanks.

@amaltaro
Copy link
Contributor

amaltaro commented Feb 3, 2023

@ivanreid apologies for the delay on getting back to this thread.

I recompiled everything that we discussed so far to better understand this/these problems. Here is a fresh summary of the issue:

  1. Production/Processing jobs run on T3_CH_Volunteer resources, which has site-local-config configured to: phedex-node=T3_CH_CMSAtHome.
  2. All the remaining jobs/tasks run on T3_CH_CMSAtHome resources, which has site-local-config configured to: phedex-node=T2_CH_CERN.

Looking into: ireid_TC_SLC7_IDR_CMS_Home_230102_183019_9782, we can observe the following:

  • Merge jobs are succeeding (thus, reading location that gets provided by T3_CH_Volunteer jobs).
    • Jobs are submitted to T3_CH_CMSAtHome
  • LogCollect for unmerged GenSimFull files are succeeding (thus, reading location that gets provided by T3_CH_Volunteer jobs).
  • LogCollect for Merge GenSimFull files are NOT succeeding (thus, reading location that gets provided by T3_CH_CMSAtHome).
    • Jobs are submitted to T2_CH_CERN_P5, T2_CH_CERN, T2_CH_CERN_HLT.
    • Fail because they never get to run.
  • Cleanup for unmerged GenSimFull files are NOT succeeding (thus, reading location that gets provided by T3_CH_Volunteer).
    • Jobs are submitted to T3_CH_CMSAtHome.
    • These fail with "Message: Failed to clean up any files that were no longer needed (WMAgent)."

I copied a recent log for a cleanup job from vocms0267, you can find it here:
https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_10751/
and I think the reason it fails is that it uses the Stage Out configuration (which points to T2_CH_CERN, since it's running in a CMSAtHome resource) to actually delete files in the storage.
Command executed is:
"""
Command : xrdfs eoscms.cern.ch rm /eos/cms/store/unmerged/DMWM_Test/RelValSinglePiE50HCAL/GEN-SIM/TC_SLC7_CMS_Home_IDRv6a-v11/00007/9C44AA05-CE1F-564F-8043-4C784FE1B1D5.root
"""

LogCollect for MergeGenSimFull fail because their input data is available at T2_CH_CERN storage, hence jobs are submitted to the T2_CH_CERN computing sites. Given that this pool has no access to those resources, they sit there idle until they get killed by the agent for a too long period in Idle status (5 days).

The module responsible for deleting files from the storage (Cleanup jobs) is:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/DeleteMgr.py

Please let us know in case further assistance is needed. And thank you for looking into it!

@ivanreid
Copy link

ivanreid commented Feb 3, 2023

@amaltaro Could you please also make available a log file from a successful merge job? Thanks.

@amaltaro
Copy link
Contributor

amaltaro commented Feb 3, 2023

@ivanreid sure, it's now available under the same web area as mentioned above.
Note that merge tarball has multiple condor* logs, reason is that that job failed in the first try (901124) and then succeeded in its first retry (901129). The logfile wmagentJob.log is unique though and it gets overwritten between job retries. HTH.

@ivanreid
Copy link

ivanreid commented Feb 9, 2023

My first impression on reading the log files is that DeleteManager.py gets the PNN from the siteconf local-stage-out (lines 80-124), which we have pointed to T2_CH_CERN, but the merge jobs use the event-data instead (i.e. the data-bridge), lines 189-207 in SiteLocalConfig.py. So the cleanup jobs are ignoring event-data.

@ivanreid
Copy link

ivanreid commented Feb 10, 2023

I haven't been able to work out exactly where/how the event-data tfc is passed to the merge jobs, the python code is rather more advanced than my pyRoot coding, but it obviously is since the right files are merged. The question probably now becomes, should DeleteManager just blithely assume that files to be deleted are residing in LocalStageOut?

@ivanreid
Copy link

ivanreid commented Feb 13, 2023

OK, so the merge jobs are actually CMSSW (cmsRun) jobs, so that might explain why they pick up the event-data tfc. By running the following awk code
BEGIN { ed=0; lso=0;} /site name/ { site=$0;} /<event-data>/ { ed=1; ecat=" ";} /<\/event-data>/ { ed=0;} /<local-stage-out/ { lso=1; lcat=" ";} /<\/local-stage-out>/ { lso=0; if (ecat != lcat) {print site " Catalogue files are different:"; print ecat; print lcat; print " ";}} (ed) && /catalog/ { split($2,a,"?"); ecat=a[1];} (lso) && /catalog/ { split($2,a,"?"); lcat=a[1];}
over /cvmfs/cms.cern.ch/SITECONF/T*/JobConfig/site-local-config.xml, I find differences between the event-data tfc and local-stage-out tfc for T0_CH_CSCS_HPC, T2_BE_UCL, T2_ES_IFCA, T3_CH_CERN_HelixNebula_REHA, T3_CH_CMSAtHome, T3_RU_SINP, T3_US_ANL, T3_US_Anvil, T3_US_Baylor, T3_US_Colorado, T3_US_Lancium, T3_US_NERSC, T3_US_NEU, T3_US_OSG, T3_US_PSC, T3_US_PuertoRico, T3_US_Rice, T3_US_Rutgers, T3_US_SDSC, T3_US_TACC, T3_US_UIowa, T3_US_UMD and T3_US_UMiss -- although some of these are due to my simplistic parsing. [Edit: Further investigation with grafana shows that most of the T3 sites are either not monitored, or only used for analysis jobs.]

So, it's probable that changing the cleanup jobs to use the event-data tfc might affect other sites as well. We could possibly put in a conditional section depending on the site name, but that seems inelegant to me.

@ivanreid
Copy link

For discussion, here's a possible but inelegant solution to our immediate problem. Treat it as pseudo-code as I'm not sure if I've interpreted all the classes correctly, or if their members are available at the time. It also ignores possible problems downstream (e.g. when the unmerged mergeLog files in eoscms.cern.ch//eos/cms/store/unmerged/logs/prod/ are to be cleaned up).

[lxplus710:WMCore] > git diff
diff --git a/src/python/WMCore/Storage/DeleteMgr.py b/src/python/WMCore/Storage/DeleteMgr.py
index ff3df65..cc9451a 100644
--- a/src/python/WMCore/Storage/DeleteMgr.py
+++ b/src/python/WMCore/Storage/DeleteMgr.py
@@ -101,7 +101,11 @@ class DeleteMgr(object):
msg += "Unable to perform StageOut operation"
raise StageOutInitError(msg)
msg += "Local Stage Out PNN to be used is %s\n" % pnn
- catalog = self.siteCfg.localStageOut.get("catalog", None)
+
+ if self.siteName == "T3_CH_CMSAtHome":
+ catalog = self.siteCfg,eventData.get("catalog", None)
+ else:
+ catalog = self.siteCfg.localStageOut.get("catalog", None)
if catalog == None:
msg = "Unable to retrieve local stage out catalog\n"
msg += "From site config file.\n"

@ivanreid
Copy link

On the basis that "it's easier to ask for forgiveness than permission", Federica and I applied the suggested patch on our agent on vocms0267, and started a new workflow. However, my fears that I might have got class membership/hierarchy wrong have been borne out. The cleanup jobs are now failing with the error "'DeleteMgr' object has no attribute 'siteName'".
Back to the drawing-board...

@amaltaro
Copy link
Contributor

@ivanreid thank you for this investigation so far. I am not saying this is the right way to resolve this issue, but if I follow the logic of the code, we can say.

If there is no override, this code (or a few lines below) get triggered:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/DeleteMgr.py#L72

which ends up creating an instance of SiteLocalConfig:

config = SiteLocalConfig(actualPath)

and that is the object returned to the DeleteMgr module. Inspecting the SiteLocalConfig object, we can see siteName here:

So, your if statement above should likely be replaced by:

if self.siteCfg.siteName == "T3_CH_CMSAtHome":

This will not work though for scenarios where DeleteMgr is invoked with an override setup:

def initialiseOverride(self):

which I do not know by heart what are those, but I am sure it's either used by specific tasks in central production, or by the T0 team.

@ivanreid
Copy link

@amaltaro Yes, I saw there was an override provision, but haven't tried to track down where it's used. I had already changed the if statement as you suggested, now we're back to trying to delete from eos again. I've asked Federica to put more diagnostics into our patched DeleteMgr.py to try to better understand what is happening. I've also explicitly put T3_CH_CMSAtHome into our event-data path in the site-config rather than 'local', but that doesn't appear to be having any affect.

@ivanreid
Copy link

Unfortunately, the suggested patch does not work, because the variable catalog is not used! Instead a call is made to self.siteCfg.trivialFileCatalog() which explicitly returns localStageOut.

@fanzago
Copy link

fanzago commented Feb 21, 2023

Hi Alan, do you know if there is already a way to force the usage of fallback information instead of the info in locatStageOut of self.siteCfg obj for the delete command?
Because it seems that there are no override in our cleanup jobs. But if we are able to use the information provided by self.siteCfg.fallbackStageOut, we should be able to delete files from bridge.
So the solution (to try) is to explicitly force the usage of fallback information
Otherwise I think we have to modify the initialiseSiteConf function of DeleteMgr.py in order to skip the self.tfc =
self.siteCfg.trivialFileCatalog() call and to provide the needed info in other way.

@fanzago
Copy link

fanzago commented Mar 13, 2023

Dear Alan,
the "patch" we have added in the DeleteMgr.py file of WMAgent, about the Cleanup step executed in T3_CH_CMSAtHome site, works well.
Currently we don't have failures in our workflows.
We "forced" the usage of the information provided by the site-local-config.xml instead of the usage of the info that are used as default in the cleanup step.
In order to do it, in the init of the DeleteMgr.py we check if the site name is T3_CH_CMSAtHome and in this case, we call the new initialiseCmsAtHome function instead of the original one initialiseSiteConf.
For sure this is not the best patch because it depends on the site name. But in order to avoid this patch and make the cleanup step working for all the sites, we think the logic of the DeleteMgr.py should be changed:

  • either the information from where to delete files is taken, for all the sites, from the part of the site-local-config.xml instead of
  • or the delete command, like the copy command, has to be retried using the fallback information, in the case the first attempt using the info fails.
    But these two other solutions need expert evaluation, we don't know how they can impact in other commands, steps, sites...
    We submit a Pull Request, if the code can be accepted as is....

@amaltaro
Copy link
Contributor

@fanzago Federica, thank you for this investigation.

Even if we don't merge your contribution into master, it's good to have it in the format of a PR because we can easily patch the CMS@Home agent.

For a more permanent fix, we can try to make it a priority for the upcoming quarter and have a closer look into it. We should get started with that planning in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

Successfully merging a pull request may close this issue.

4 participants