Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[brainstorming] Improve job accounting #2131

Open
tcompa opened this issue Dec 10, 2024 · 1 comment
Open

[brainstorming] Improve job accounting #2131

tcompa opened this issue Dec 10, 2024 · 1 comment

Comments

@tcompa
Copy link
Collaborator

tcompa commented Dec 10, 2024

(with @mfranzon)

High-level goal: we should gather more detailed information about "how much" user X used Fractal in a given time window.

For each proxy of this quantity, we should define how/where the information is stored and how/where it is exposed to a Fractal admin or to a standard user. Note that it's not given that storage and access to this information must be through fractal-server, we'll have to evaluate this later (e.g. if we need to rely a lot on SLURM database, we won't copy all this info over in the fractal DB).

Side comment: every piece of information should come with a timestamp and with a reference to a Fractal user.

Fractal-based proxies

The following metrics are already directly or indirectly present in the db:

  1. Number of Fractal jobs.
  2. Number of Fractal tasks for each submitted Fractal job.
  3. Elapsed time for each Fractal job (through timestamps)
  4. Elapsed time for each Fractal task (through logs)
  5. Number of projects, datasets, workflows; number of images per dataset; number of tasks per workflows. All these figures are not very useful, as they describe the current situation and may be different from the one when jobs ran.

Here is some additional information that we could gather within Fractal:

  1. For each Fractal task, we can count the number of processed images. Preliminary questions:
    • Is the count homogeneous for non-parallel/parallel/compound tasks?
    • How many images are counted for a compound task? Examples: (1) MIP task going from N images to N different images, (2) time-slice or channel parallelization.
  2. Information about SLURM job IDs:
    • A list of SLURM job IDs corresponding to each Fractal task.
    • The SLURM job ID corresponding to each single component (that is, OME-Zarr image) of each Fractal task.

SLURM-based proxies

If we have easy access (from within Fractal) to information like "give me all SLURM job IDs for user X in time window from A to B", then we can prepare a tool which queries the SLURM db (from outside fractal) and processes all relevant information to produce a tabular output. In this scenario, we have access to a huge amount of SLURM details, and we'll expose the ones that matter.

In the future, one can also push for a deeper integration of this external tool into fractal, so that e.g. it is exposed from within the admin area (as if it were a task collection operation).

A note

To check: is SLURM db (the one accessed through sacct) maintained long-term? I guess so, but we should double check.

@jluethi
Copy link
Collaborator

jluethi commented Dec 11, 2024

Fractal server-side monitoring:

  • Keeping a record of jobs run & # tasks on (preferably: As in: Run workflow with 2 tasks on 10 images => 20 tasks were run)
  • Keep track of number of images added to image list: Having a counter of "number of images that were created in an image list": Current number of images in datasets & "total counter ever created"

Slurm-based accounting:

  • Number of TBs processed: Not sure how we'd get there. IO mediated? bytes written?
  • Processing times mediated => via slurm-based proxies => CPU hours, GPU hours, (memory hours(?))
  • 1 on doing slurm-based accounting outside of Fractal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants