Skip to content

Draining Steps

Alan Malta Rodrigues edited this page Jul 9, 2018 · 16 revisions

What's the impact of an agent in drain mode

When an agent is set to drain mode, this is what happens:

  • it will NOT pull any new work from central WorkQueue (but it will still acquire local WQE)
  • it maximizes the pending slots for all sites, as if there was a single agent connected to the same team (normally pending thresholds are distributed among the agents sharing the same team).

How to drain an agent

  1. Mark the agent as in drain mode such that it doesn't pull any new work:

This means the agent won't pull any workqueue elements from global workqueue. It will only process what's already in the LQ, maximize the site thresholds and try to finish those jobs as soon as possible.

The draining configuration has been moved out of the agents's config.py file to the central couch db, reqmgr auxiliary database. Thus one only need to have the user proxy (or a service certificate) created and run the following curl command (replacing vocms0999.cern.ch by the actual host you want to put in drain mode):

curl --cert $X509_USER_CERT --key $X509_USER_KEY -X PUT -H "Content-type: application/json" -d '{"UserDrainMode":true}' https://cmsweb.cern.ch/reqmgr2/data/wmagentconfig/vocms0999.cern.ch

In case you want to take the agent out of drain, you need to run the same command but instead of setting it to true, set it to false, e.g.:

curl --cert $X509_USER_CERT --key $X509_USER_KEY -X PUT -H "Content-type: application/json" -d '{"UserDrainMode":false}' https://cmsweb.cern.ch/reqmgr2/data/wmagentconfig/vocms0999.cern.ch
  1. check all the stuck job condition (site is not available, etc) and apply proper procedure. (Alan, could you add what need to be done.

  2. check all the workflow is the finished state.(#7493) Ref

    1. all the workflow is completed in the agent (no subscription left)
    2. all the blocks are closed.
    3. all the files are injected to PhEDEx
    4. all the files are uploaded to DBS
  3. report to WMStats (#7493)

Speedy draining

Setting maxRetries to 0

Some components functionalities were moved to the reqmgr_auxiliary central DB, so the new way to set the maximum number of job retries to 0 is to update the specific agent document in the couchdb. Or you can run the following curl command to automatically update it to 0 (note to update the agent name):

curl --cert $X509_USER_CERT --key $X509_USER_KEY -X PUT -H "Content-type: application/json" -d '{"MaxRetries":0}' https://cmsweb.cern.ch/reqmgr2/data/wmagentconfig/vocms0XXX.cern.ch

Setting maxRetries to 0 (legacy agents)

Once an agent goes below 500 jobs in condor, I think the following settings can be applied such that the agent completes all those jobs as soon as possible.

First, set maxRetries to 0 such that any failures will be terminal and ACDC documents will be created. In the agent configuration file config/wmagent/config.py, replace this line :

config.ErrorHandler.maxRetries = {'default': 3, 'Merge': 4, 'Cleanup': 2, 'LogCollect': 1, 'Harvesting': 2}

by:

config.ErrorHandler.maxRetries = 0

Now restart ErrorHandler and RetryManager

$manage execute-agent wmcoreD --restart --component ErrorHandler,RetryManager

Enabling all sites

This has to be applied only when there are very few jobs to run, let's less than 100 jobs in condor.

In order to make these changes, we need to first disable the AgentStatusWatcher component, otherwise it will automatically set sites to their respective status in SSB (drain, normal, etc) and set the thresholds according. In the agent configuration file config/wmagent/config.py, replace this line:

config.AgentStatusWatcher.enabled = True

by:

config.AgentStatusWatcher.enabled = False

Then restart AgentStatusWatcher

$manage execute-agent wmcoreD --restart --component AgentStatusWatcher

Now we can manually update a couple of tables in the SQL db. Open MariaDB/Oracle prompt with

$manage db-prompt wmagent

and execute the following update statements (it enables ALL sites and set thresholds to a reasonable number):

    UPDATE wmbs_location SET state=(SELECT id from wmbs_location_state where name='Normal') WHERE state!=(SELECT id from wmbs_location_state where name='Normal');
    UPDATE wmbs_location SET running_slots=1000, pending_slots=1000;
    UPDATE rc_threshold SET max_slots=1000, pending_slots=1000;

Rolling a new WMAgent version to production

This section contains a short description of how we push a new WMAgent release to production, which is different then simply upgrading an agent.

First, some background information which is useful for a better understanding of this procedure.

  • A new WMAgent stable version is released every 2 months (sometimes 2.5 months);
  • new version for central services is released and deployed in CMSWEB every month;
  • a new WMAgent release candidate is made available right after the CMSWEB production upgrade (within 2 days);
  • then the validation process starts and it takes in average a week to validate and fix any issues, before we cut the final stable release;
  • from this point on, we have another branch to maintain and make sure that important fixes are backported to it (in addition to the master branch).

Given the background information above, we can have two different WMAgent upgrade scenarios:

  1. there are (severe) breaking changes in the agents (usually related to reqmgr2 and/or workqueue, or related to the database schema) and we can't have a mix of agents pulling work from WorkQueue. Thus all the old agents have to be put into drain at the same time and new agent releases are moved to up&running.
  2. the most common scenario, where there are no breaking changes and both WMAgent versions can coexist and pull work from global workqueue

No matter the scenario, the first goal is to have the old release replaced by the new one ASAP. Such that we don't need to maintain 3 different branches, investigating and backporting issues from different releases.

In addition to that, we shouldn't have WMAgents falling (too) behind with new developments made in ReqMgr2 and WorkQueue, thus, ideally, we should have ALL old releases at least in drain mode until the next CMSWEB upgrade date, much better if they are off of the grid, of course. Expanding on it, there is a new CMSWEB release every ~4 weeks, validation/bugfix takes a week'ish, that gives us 3 weeks to get all agents rotated and 5 weeks left to run a pool of new WMAgent releases (until the next upgrade starts).

Clone this wiki locally