Document job directives and environment variables #185

matthew-richerson · 2024-07-26T19:07:12Z

No description provided.

Signed-off-by: Matt Richerson <[email protected]>

bdevcich · 2024-07-29T13:32:41Z

docs/guides/user-interactions/readme.md

+
+A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes.
+
+Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information.


Perhaps add something like: these variables are provided to the workload manager.

Nevermind. I see there's a section at the end for this.

bdevcich · 2024-07-29T13:33:58Z

docs/guides/user-interactions/readme.md

+
+## Overview
+
+A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes.


You use [command] here but then your next section is Directives. I think you should use the same terms here.

bdevcich · 2024-07-29T13:35:59Z

docs/guides/user-interactions/readme.md

+| `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. | 
+| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` |
+| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job |
+| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. |


Link to storage profile docs?

bdevcich · 2024-07-29T13:37:31Z

docs/guides/user-interactions/readme.md

+#DW jobdw type=gfs2 capacity=50GB name=checkpoint requires=copy-offload
+```
+
+This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running.


Suggested change

This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running.

This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running using the Copy Offload API.

bdevcich · 2024-07-29T13:40:36Z

docs/guides/user-interactions/readme.md

+| Argument | Required | Value | Notes |
+|----------|----------|-------|-------|
+| `source` | Yes | `[path]`, `$DW_JOB_[name]/[path]`, `$DW_PERSISTENT_[name]/[path]` | `[name]` is the name of the Rabbit persistent or job storage as specified in the `name` argument of the `jobdw` or `persistentdw` directive. Any `'-'` in the name from the `jobdw` or `persistentdw` directive should be changed to a `'_'` in the `copy_in` and `copy_out` directive. |
+| `destination` | Yes | `[path]`, `$DW_JOB_[name]/[path]`, `$DW_PERSISTENT_[name]/[path]` | `[name]` is the name of the Rabbit persistent or job storage as specified in the `name` argument of the `jobdw` or `persistentdw` directive. Any `'-'` in the name from the `jobdw` or `persistentdw` directive should be changed to a `'_'` in the `copy_in` and `copy_out` directive. |


Looks like the source/destination notes are copy and pasted from the persistent directives

roehrich-hpe · 2024-07-29T15:05:54Z

docs/guides/user-interactions/readme.md

+|----------|----------|-------|-------|
+| `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. | 
+| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` |
+| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job |


Every place in this doc where you say, "String including numbers..." you should say that it must be lowercase letters.

These ultimately refer to k8s resources, which require lowercase.

roehrich-hpe · 2024-07-29T15:07:46Z

docs/guides/user-interactions/readme.md

+
+### create_persistent
+
+The `create_persistent` command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single `create_persistent` directive is allowed in a job, and it cannot be in the same job as a `destroy_persistent` directive.


Add: See persistent_dw to utilize the storage in a job.

roehrich-hpe · 2024-07-29T15:14:53Z

docs/guides/user-interactions/readme.md

+| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` |
+| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job |
+| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. |
+| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application | 


Maybe some reference to:

See `RequiredDaemons` in [Directive Breakdown](../directive-breakdown/readme.md) for a description of how the user may request the daemon, in the case where the WLM will run it only on demand.

In case an admin is reading your doc, because the user pointed at it in their complaint, and needs a prod to know where else to look.

roehrich-hpe · 2024-07-29T15:19:01Z

docs/guides/user-interactions/readme.md

+#DW copy_out source=$DW_PERSISTENT_shared_data1/b destination=$DW_PERSISTENT_shared_data2/b profile=no-xattr
+```
+
+This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the `no-xattr` profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a guaranteed order.


guaranteed => deterministic ?

Signed-off-by: Matt Richerson <[email protected]>

Document job directives and environment variables

025caa3

Signed-off-by: Matt Richerson <[email protected]>

matthew-richerson requested review from ajfloeder, roehrich-hpe and bdevcich July 26, 2024 19:07

bdevcich reviewed Jul 29, 2024

View reviewed changes

bdevcich approved these changes Jul 29, 2024

View reviewed changes

roehrich-hpe approved these changes Jul 29, 2024

View reviewed changes

review comments

245a2d0

Signed-off-by: Matt Richerson <[email protected]>

matthew-richerson merged commit daed78b into main Jul 29, 2024
1 check passed

matthew-richerson deleted the directives branch July 29, 2024 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document job directives and environment variables #185

Document job directives and environment variables #185

matthew-richerson commented Jul 26, 2024

bdevcich Jul 29, 2024

bdevcich Jul 29, 2024

bdevcich Jul 29, 2024

bdevcich Jul 29, 2024

bdevcich Jul 29, 2024

bdevcich Jul 29, 2024

roehrich-hpe Jul 29, 2024

roehrich-hpe Jul 29, 2024

roehrich-hpe Jul 29, 2024

roehrich-hpe Jul 29, 2024


		A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes.

		Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information.


		## Overview

		A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes.

	This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running.
	This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running using the Copy Offload API.


		### create_persistent

		The `create_persistent` command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single `create_persistent` directive is allowed in a job, and it cannot be in the same job as a `destroy_persistent` directive.

Document job directives and environment variables #185

Document job directives and environment variables #185

Conversation

matthew-richerson commented Jul 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment