In order to update the codebase in the vocms0241 virtual machine,
- Please push your commits or open your PRs against "https://github.com/dmwm/cms-htcondor-es/tree/master".
- In virtual machine, fetch:
git fetch --all
- Check the difference between local and remote to be sure:
git diff origin/master master
- Before rebase, please make sure if 12m period is finished:
- No python spider_cms process:
ps aux | grep "python spider_cms.py"
- And
tail -f log/spider_cms.log
is like@@@ Total processing time:...
- No python spider_cms process:
- Finally rebase
git pull --rebase origin master
https://cmsmonit-docs.web.cern.ch/cms-htcondor-es/spider/
- Cron jobs; main and test scripts: cms-htcondor-es/scripts . Read all details and parameters of used scripts.
- Filebeat and Logstash service. Complete documentation: cms-htcondor-es/service-logstash
*/12 * * * * /home/cmsjobmon/cms-htcondor-es/scripts/spider_cms_queues.sh
5-59/12 * * * * /home/cmsjobmon/cms-htcondor-es/scripts/spider_cms_history.sh
0 3 * * * /bin/bash "/home/cmsjobmon/cms-htcondor-es/scripts/cronAffiliation.sh"
# 0 * * * * /bin/bash "/home/cmsjobmon/scripts/sourcesCompare_cron.sh"
cmsjobmon
user is used to run the cron jobs.logstash
user is used to run the Logstash service.
- Current virtual machine has python3.9 and before deployment, new virtual environment should be created with
venv3_9
name. - Requirements should be installed before cron jobs are activated.
- To make migrations easy, please use meaningful "venv" names:
cd /home/cmsjobmon/cms-htcondor-es && python3 -m venv venv3_X && ./venv3_X/bin/pip install --no-cache-dir -r requirements.txt
Required CERN CA certificate: /etc/pki/tls/certs/CERN-bundle.pem
Used environment variables in scripts:
export SPIDER_WORKDIR="/home/cmsjobmon/cms-htcondor-es"
export AFFILIATION_DIR_LOCATION="$SPIDER_WORKDIR/.affiliation_dir.json"
file | path | description |
---|---|---|
SPIDER_WORKDIR |
$HOME/cms-htcondor-es |
MAIN |
usercert.pem |
$HOME/.globus/usercert.pem |
used by Affiliation cron job to query CRIC and update affiliations |
userkey.pem |
$HOME/.globus/userkey.pem |
used by Affiliation cron job to query CRIC and update affiliations |
file | parametrized value | description |
---|---|---|
amq_username |
$SPIDER_WORKDIR/etc/amq_username |
ActiveMQ user |
amq_password |
$SPIDER_WORKDIR/etc/amq_password |
ActiveMQ password |
.affiliation_dir.json |
$SPIDER_WORKDIR/.affiliation_dir.json |
stores affiliations of all users in JSON format |
checkpoint.json |
$SPIDER_WORKDIR/checkpoint.json |
last update times of each history schedd |
collectors.json |
$SPIDER_WORKDIR/etc/collectors.json |
Collectors that will be queried |
es_conf.json |
$SPIDER_WORKDIR/etc/es_conf.json |
CMS os-cms OpenSearch host, username, password credentials |
JobMonitoring.json |
$SPIDER_WORKDIR/JobMonitoring.json |
ClassAds to JSON format conversion schema |
last_mappings.json |
$SPIDER_WORKDIR/last_mappings.json |
CMS os-cms OpenSearch index mapping |
log, log_history, log_aff |
$SPIDER_WORKDIR/log* |
Log directories |
venv3_9 |
$SPIDER_WORKDIR/venv3_9 |
used venv |
CONVENTIONS:
- Host name ends with
/os
refers to OpenSearch otherwise it is behaved as ElasticSearch and port should be provided. - No need to provide port for OpenSearch.
- Supports multiple clusters and send data to all of them without querying schedds again.
- Do not put
https://
prefix.
[
{
"host": "es-cmsX(ElasticSearch).cern.ch",
"port": 9203,
"username": "user",
"password": "pass"
},
{
"host": "os-cmsY(OpenSearch).cern.ch/os",
"username": "user",
"password": "pass"
}
]
Example scenario to send data to 2 OpenSearch instances:
[
{"host": "os-cms1.cern.ch/os", "username": "user", "password": "pass"},
{"host": "os-cms2.cern.ch/os", "username": "user", "password": "pass"}
]
- Use Dockerfile which contains everything except for secrets. You need to provide secrets of Required directories, files and secrets
- You can use test cron scripts instead of production equivalents ("spider_cms_history.sh" and "spider_cms_queues.sh") to test whole pipeline. They run in
read_only
mode by default. You can use test AMQ producer and test OpenSearch index. Everything you need for test: - To debug, use tests/debugcli.py
- Prepare all the production requirements: directory, files, secrets and venv.
- Take backup of
$SPIDER_WORKDIR/checkpoint.json
as$SPIDER_WORKDIR/checkpoint.json.back
. - Set new crontab and let them start.
- For LogStash and Filebeat migration, please follow documentation of: cms-htcondor-es/service-logstash
Scenario-1: Data send wrongly
- Rollback to previous commit, directory&file structure and venv.
- Rollback to back-up checkpoint:
mv checkpoint.json.back checkpoint.json
. - Delete documents from os-cms.cern.ch
cms-*
indices withmetadata.spider_git_hash='problematic_hash'
. - Ask MONIT to delete documents from monit-opensearch.cern.ch
monit_prod_condor_raw_metric*
indices withdata.metadata.spider_git_hash='problematic_hash'
. - Inform and announce HDFS users to filter out
metadata.spider_git_hash='problematic_hash'
in their Spark jobs.- HDFS data deletion: it is not easy because of Flume data. It can be deleted after a couple of days later. Discuss with MONIT.
- Data loss is inevitable(12m or more) for Condor Job Monitoring schedds Queue data (Running, Held, Idle, etc.).
- No data loss for schedds History data (completed, etc.) because of checkpoint.json if required actions are taken immediately.
Scenario-2: not worked at all
- Rollback to previous commit, directory&file structure and venv.
- Rollback to back-up checkpoint:
mv checkpoint.json.back checkpoint.json
. - Data loss is inevitable(12m or more) for Condor Job Monitoring schedds Queue data (Running, Held, Idle, etc.).
- No data loss for schedds History data (Completed, etc.) because of checkpoint.json if required actions are taken immediately.
Scenario-3: failed after some time later
In any case, data loss is inevitable for Condor Job Monitoring schedds Queue data (Running, Held, Idle, etc.)
- If problem is noticed within 12 hours, no action needed. Just fix the problem and continue to run.
- If problem is noticed later than 12 hours, set
--history_query_max_n_minutes=12*60
parameter inspider_cms_history.sh
to a reasonable time window to include last checkpoints ofcheckpoint.json
to fill the schedds History results. Else, data loss will be in case for schedds History data. It is because of a safety measure: even if schedds have previous checkpoint time thanhistory_query_max_n_minutes
, their checkpoints will be set to this max time window. It can be tweaked in disaster scenarios.
# Get indices
curl -s -XGET --negotiate -u $user:$pass https://os-cms1.cern.ch/os/_cat/indices | grep ceyhun | sort
# Delete one index
curl -s -XDELETE --negotiate -u $user:$pass https://os-cms1.cern.ch/os/cms-test-ceyhun-2023-02-21
# Deletion command(NO RUN, just echo) for multiple indices with regex
echo 'curl -XDELETE --negotiate -u $user:$pass' https://os-cms1.cern.ch/os/"$(
curl -s -XGET --negotiate -u $user:$pass https://os-cms1.cern.ch/os/_cat/indices/cms-test-ceyhun* | sort | awk '{ORS=","; print $3}'
)"
# Get indices
curl -s -XGET -u $user:$pass https://es-cms.cern.ch:9203/_cat/indices/ | grep ceyhun | sort
# Delete one indice
curl -s -XDELETE -u $user:$pass https://es-cms.cern.ch:9203/cms-test-ceyhun-2023-02-21
# Deletion command(NO RUN, just echo) for multiple indices with regex
echo 'curl -XDELETE -u $user:$pass' https://es-cms.cern.ch:9203/"$(
curl -s -XGET -u $user:$pass https://es-cms.cern.ch:9203/_cat/indices/cms-test-ceyhun* | sort | awk '{ORS=","; print $3}')"