-
Notifications
You must be signed in to change notification settings - Fork 108
Draining Steps
- Tell WorkQueueManager to not accept anymore work blocks:
open config file at:
/data/srv/wmagent/current/config/wmagent/config.py
Look for the workqueue params
config.WorkQueueManager.queueParams = {'ParentQueueCouchUrl': 'https://cmsweb.cern.ch/couchdb/workqueue'}
Add the 'DrainMode': True to the params.
config.WorkQueueManager.queueParams = {'DrainMode': True, 'ParentQueueCouchUrl': 'https://cmsweb.cern.ch/couchdb/workqueue'}
save the changes
restart WorkQueueManager, AnalyticsDataCollector and AgentStatusWatcher. (We can change the code to automatically pick up the config change #7994)
$manage execute-agent wmcoreD --restart --components=WorkQueueManager,AnalyticsDataCollector,AgentStatusWatcher
Undo these changes if you want to set the agent out of drain.
-
check all the stuck job condition (site is not available, etc) and apply proper procedure. (Alan, could you add what need to be done.
-
check all the workflow is the finished state.(#7493) Ref
- all the workflow is completed in the agent (no subscription left)
- all the blocks are closed.
- all the files are injected to PhEDEx
- all the files are uploaded to DBS
-
report to WMStats (#7493)
Once an agent goes below 500 jobs in condor, I think the following settings can be applied such that the agent completes all those jobs as soon as possible.
First, set maxRetries to 0 such that any failures will be terminal and ACDC documents will be created. In the agent configuration file config/wmagent/config.py, replace this line :
config.ErrorHandler.maxRetries = {'default': 3, 'Merge': 4, 'Cleanup': 2, 'LogCollect': 1, 'Harvesting': 2}
by:
config.ErrorHandler.maxRetries = 0
Now restart ErrorHandler and RetryManager
$manage execute-agent wmcoreD --restart --component ErrorHandler,RetryManager
This has to be applied only when there are very few jobs to run, let's less than 100 jobs in condor.
In order to make these changes, we need to first disable the AgentStatusWatcher component, otherwise it will automatically set sites to their respective status in SSB (drain, normal, etc) and set the thresholds according. In the agent configuration file config/wmagent/config.py, replace this line:
config.AgentStatusWatcher.enabled = True
by:
config.AgentStatusWatcher.enabled = False
Then restart AgentStatusWatcher
$manage execute-agent wmcoreD --restart --component AgentStatusWatcher
Now we can manually update a couple of tables in the SQL db. Open MariaDB/Oracle prompt with
$manage db-prompt wmagent
and execute the following update statements (it enables ALL sites and set thresholds to a reasonable number):
UPDATE wmbs_location SET state=(SELECT id from wmbs_location_state where name='Normal') WHERE state!=(SELECT id from wmbs_location_state where name='Normal');
UPDATE wmbs_location SET running_slots=1000, pending_slots=1000;
UPDATE rc_threshold SET max_slots=1000, pending_slots=1000;