[brainstorming] Improve job accounting #2131

tcompa · 2024-12-10T13:37:01Z

High-level goal: we should gather more detailed information about "how much" user X used Fractal in a given time window.

For each proxy of this quantity, we should define how/where the information is stored and how/where it is exposed to a Fractal admin or to a standard user. Note that it's not given that storage and access to this information must be through fractal-server, we'll have to evaluate this later (e.g. if we need to rely a lot on SLURM database, we won't copy all this info over in the fractal DB).

Side comment: every piece of information should come with a timestamp and with a reference to a Fractal user.

Fractal-based proxies

The following metrics are already directly or indirectly present in the db:

Number of Fractal jobs.
Number of Fractal tasks for each submitted Fractal job.
Elapsed time for each Fractal job (through timestamps)
Elapsed time for each Fractal task (through logs)
Number of projects, datasets, workflows; number of images per dataset; number of tasks per workflows. All these figures are not very useful, as they describe the current situation and may be different from the one when jobs ran.

Here is some additional information that we could gather within Fractal:

For each Fractal task, we can count the number of processed images. Preliminary questions:
- Is the count homogeneous for non-parallel/parallel/compound tasks?
- How many images are counted for a compound task? Examples: (1) MIP task going from N images to N different images, (2) time-slice or channel parallelization.
Information about SLURM job IDs:
- A list of SLURM job IDs corresponding to each Fractal task.
- The SLURM job ID corresponding to each single component (that is, OME-Zarr image) of each Fractal task.

SLURM-based proxies

If we have easy access (from within Fractal) to information like "give me all SLURM job IDs for user X in time window from A to B", then we can prepare a tool which queries the SLURM db (from outside fractal) and processes all relevant information to produce a tabular output. In this scenario, we have access to a huge amount of SLURM details, and we'll expose the ones that matter.

In the future, one can also push for a deeper integration of this external tool into fractal, so that e.g. it is exposed from within the admin area (as if it were a task collection operation).

A note

To check: is SLURM db (the one accessed through sacct) maintained long-term? I guess so, but we should double check.

The text was updated successfully, but these errors were encountered:

jluethi · 2024-12-11T09:30:23Z

Fractal server-side monitoring:

Keeping a record of jobs run & # tasks on (preferably: As in: Run workflow with 2 tasks on 10 images => 20 tasks were run)
Keep track of number of images added to image list: Having a counter of "number of images that were created in an image list": Current number of images in datasets & "total counter ever created"

Slurm-based accounting:

Number of TBs processed: Not sure how we'd get there. IO mediated? bytes written?
Processing times mediated => via slurm-based proxies => CPU hours, GPU hours, (memory hours(?))

1 on doing slurm-based accounting outside of Fractal

tcompa added slurm runner labels Dec 10, 2024

jluethi added this to Fractal Project Management Dec 10, 2024

github-project-automation bot moved this to TODO in Fractal Project Management Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[brainstorming] Improve job accounting #2131

[brainstorming] Improve job accounting #2131

tcompa commented Dec 10, 2024

jluethi commented Dec 11, 2024 •

edited

Loading

[brainstorming] Improve job accounting #2131

[brainstorming] Improve job accounting #2131

Comments

tcompa commented Dec 10, 2024

Fractal-based proxies

SLURM-based proxies

A note

jluethi commented Dec 11, 2024 • edited Loading

jluethi commented Dec 11, 2024 •

edited

Loading