Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GOALS for 2025 #8879

Open
1 of 20 tasks
belforte opened this issue Jan 14, 2025 · 0 comments
Open
1 of 20 tasks

GOALS for 2025 #8879

belforte opened this issue Jan 14, 2025 · 0 comments
Assignees

Comments

@belforte
Copy link
Member

belforte commented Jan 14, 2025

Picking up from #7876

goals we expect we will achieve:

  • canary deployment for REST
  • switch to HTCondor v2 API
  • reboot each schedd every 6 months or so (automatically). We can do via a crontab (in puppet) which stops submissions to Vanilla pool, holds al jobs, wa and reboots. Ned to syncronize with TW putting in drain the day before. Proposal: have hiera variable with list of reboot dates placed in TW config and parsed by TW code and placed in scheduler disk and parsed by crontab. See also schedulers - implement automatic reboot #8651
  • use of tokens for talking to CMSWEB Use tokens for crab for backend service-to-service communication #7845 (predicated on CMS TokenTeam progresses)
  • refactor/re-evalluate crab resubmit refactor crab resubmit #6270
  • revisit JobRetry.py policies (more retries, re-eval site list, add cooloff ...)
  • ArgoCD for our K8s applications ?
  • more monitoring via Spark ? Make it possible to build panels "as needed" in Grafana ?
  • a job status table for each task (avoid accessing status cache on schedd) (and modify crab status)
  • DAG status in taskDB (running/completed/failed) (and modify crab status)
  • use submitted task info in task scheduling
  • cleanup code source (remove "transients" from past migrations)
  • can we "evolve" crab status command to avoid the need to retrieve files from scheduler's WEB_DIR ? Maybe not useful ?
  • cleanup an update documentation
  • Utilitiies for users to deal with data management via Rucio (list/delete rules e.g.)
  • full monitoring for preprod ?

things that we may realistically do, but it is not clear if/when we will want to do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants