-
Notifications
You must be signed in to change notification settings - Fork 107
WMCore debugging tools
This wiki is meant to list debugging use cases, either to solve/debug Operations issues or internal Dev ones.
Problem: Ops request us to check why the workflow hasn't processed 100% of the lumi sections, even though all the failures have been recovered via ACDCs
Solution: first we need to make sure that ACDCs have been created AND executed for every single task path (fileset_name, in terms of ACDC collection).
Details: what we need to retrieve/check, is:
- did the ACDCs get created after the initial/original workflow moved to
completed
status? - list the amount of jobs/lumis in each
fileset_name
, from the ACDC collection - query reqmgr2 for ACDC workflows recovering that workflow (and fetch their
InitialTaskPath
) - make sure that those ACDC workflows are in
completed
status - anything else
Problem: Ops request us to investigate why the output datasets are missing statistics, even though there are no job failures reported (or they have all been recovered).
Solution: not necessarily a solution. However, part of the solution above has to be applied here, thus check whether all lumis have been recovered. In addition to that, we could have a tool that takes a workflow as input, it finds all the run/lumis meant to be processed, randomly selects one output dataset and compare it against the input dataset. Finally, yielding a list of run/lumis missing in the output dataset.
Problem: When we are completing the agent draining procedure, there are some rare cases where subscriptions are stuck in unfinished state (finished=0
). It also usually means that there is - at least - one GQ workqueue element in Running
state (and its equivalent LQ workqueue/workqueue_inbox element).
Solution: there are many possible reasons for having a subscription stuck, so there is no common solution. Among the checks we can perform are: correlate the subscription to its fileset and workflow task; check whether they have files either in the available or acquired tables.
Details: further details can be extracted from this github issue: https://github.com/dmwm/WMCore/issues/9568