Skip to content

WMCore debugging tools

Alan Malta Rodrigues edited this page Mar 19, 2020 · 11 revisions

This wiki is meant to list debugging use cases, either to solve/debug Operations issues or internal Dev ones.

[Ops] Debug whether all jobs have been recovered via ACDCs

Problem: Ops request us to check why the workflow hasn't processed 100% of the lumi sections, even though all the failures have been recovered via ACDCs

Solution: first we need to make sure that ACDCs have been created AND executed for every single task path (fileset_name, in terms of ACDC collection).

Details: what we need to retrieve/check, is:

  • did the ACDCs get created after the initial/original workflow moved to completed status?
  • list the amount of jobs/lumis in each fileset_name, from the ACDC collection
  • query reqmgr2 for ACDC workflows recovering that workflow (and fetch their InitialTaskPath)
  • make sure that those ACDC workflows are in completed status
  • anything else

[Ops] Find out which run/lumi is missing in the output dataset

Problem: Ops request us to investigate why the output datasets are missing statistics, even though there are no job failures reported (or they have all been recovered).

Solution: not necessarily a solution. However, part of the solution above has to be applied here, thus check whether all lumis have been recovered. In addition to that, we could have a tool that takes a workflow as input, it finds all the run/lumis meant to be processed, randomly selects one output dataset and compare it against the input dataset. Finally, yielding a list of run/lumis missing in the output dataset.

[Dev] Debugging subscriptions not finished

Problem: When we are completing the agent draining procedure, there are some rare cases where subscriptions are stuck in unfinished state (finished=0). It also usually means that there is - at least - one GQ workqueue element in Running state (and its equivalent LQ workqueue/workqueue_inbox element).

Solution: there are many possible reasons for having a subscription stuck, so there is no common solution. Among the checks we can perform are: correlate the subscription to its fileset and workflow task; check whether they have files either in the available or acquired tables.

Details: further details can be extracted from this github issue: https://github.com/dmwm/WMCore/issues/9568

Clone this wiki locally