Skip to content

Draining Steps

Alan Malta Rodrigues edited this page Nov 28, 2017 · 16 revisions
  1. Tell WorkQueueManager to not accept anymore work blocks:
open config file at:
    /data/srv/wmagent/current/config/wmagent/config.py

Look for the workqueue params
    config.WorkQueueManager.queueParams = {'ParentQueueCouchUrl': 'https://cmsweb.cern.ch/couchdb/workqueue'}

Add the 'DrainMode': True to the params.
    config.WorkQueueManager.queueParams = {'DrainMode': True, 'ParentQueueCouchUrl': 'https://cmsweb.cern.ch/couchdb/workqueue'}

save the changes

restart WorkQueueManager, AnalyticsDataCollector and AgentStatusWatcher. (We can change the code to automatically pick up the config change #7994)
    $manage execute-agent wmcoreD --restart --components=WorkQueueManager,AnalyticsDataCollector,AgentStatusWatcher

Undo these changes if you want to set the agent out of drain.

  1. check all the stuck job condition (site is not available, etc) and apply proper procedure. (Alan, could you add what need to be done.

  2. check all the workflow is the finished state.(#7493) Ref

    1. all the workflow is completed in the agent (no subscription left)
    2. all the blocks are closed.
    3. all the files are injected to PhEDEx
    4. all the files are uploaded to DBS
  3. report to WMStats (#7493)

Speedy draining - setting maxRetries to 0

Once an agent goes below 500 jobs in condor, I think the following settings can be applied such that the agent completes all those jobs as soon as possible.

First, set maxRetries to 0 such that any failures will be terminal and ACDC documents will be created. In the agent configuration file config/wmagent/config.py, replace this line :

config.ErrorHandler.maxRetries = {'default': 3, 'Merge': 4, 'Cleanup': 2, 'LogCollect': 1, 'Harvesting': 2}

by:

config.ErrorHandler.maxRetries = 0

Now restart ErrorHandler and RetryManager

$manage execute-agent wmcoreD --restart --component ErrorHandler,RetryManager

Speedy draining - enabling all sites

This has to be applied only when there are very few jobs to run, let's less than 100 jobs in condor.

In order to make these changes, we need to first disable the AgentStatusWatcher component, otherwise it will automatically set sites to their respective status in SSB (drain, normal, etc) and set the thresholds according. In the agent configuration file config/wmagent/config.py, replace this line:

config.AgentStatusWatcher.enabled = True

by:

config.AgentStatusWatcher.enabled = False

Then restart AgentStatusWatcher

$manage execute-agent wmcoreD --restart --component AgentStatusWatcher

Now we can manually update a couple of tables in the SQL db. Open MariaDB/Oracle prompt with

$manage db-prompt wmagent

and execute the following update statements (it enables ALL sites and set thresholds to a reasonable number):

    UPDATE wmbs_location SET state=(SELECT id from wmbs_location_state where name='Normal') WHERE state!=(SELECT id from wmbs_location_state where name='Normal');
    UPDATE wmbs_location SET running_slots=1000, pending_slots=1000;
    UPDATE rc_threshold SET max_slots=1000, pending_slots=1000;
Clone this wiki locally