-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document job directives and environment variables #185
Conversation
Signed-off-by: Matt Richerson <[email protected]>
|
||
A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes. | ||
|
||
Once the job is running on compute nodes, the application can find access to Rabbit specific resources through a set of environment variables that provide mount and network access information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add something like: these variables are provided to the workload manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. I see there's a section at the end for this.
|
||
## Overview | ||
|
||
A user may include one or more Data Workflow directives in their job script to request Rabbit services. Directives take the form `#DW [command] [command args]`, and are passed from the workload manager to the Rabbit software for processing. The directives can be used to allocate Rabbit file systems, copy files, and run user containers on the Rabbit nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use [command]
here but then your next section is Directives. I think you should use the same terms here.
| `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. | | ||
| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | | ||
| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job | | ||
| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to storage profile docs?
#DW jobdw type=gfs2 capacity=50GB name=checkpoint requires=copy-offload | ||
``` | ||
|
||
This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running. | |
This directive results in a 50GB GFS2 file system created for each compute node in the job using the default storage profile. The copy-offload daemon is started on the compute node to allow the application to request the Rabbit to move data from the GFS2 file system to another file system while the application is running using the Copy Offload API. |
| Argument | Required | Value | Notes | | ||
|----------|----------|-------|-------| | ||
| `source` | Yes | `[path]`, `$DW_JOB_[name]/[path]`, `$DW_PERSISTENT_[name]/[path]` | `[name]` is the name of the Rabbit persistent or job storage as specified in the `name` argument of the `jobdw` or `persistentdw` directive. Any `'-'` in the name from the `jobdw` or `persistentdw` directive should be changed to a `'_'` in the `copy_in` and `copy_out` directive. | | ||
| `destination` | Yes | `[path]`, `$DW_JOB_[name]/[path]`, `$DW_PERSISTENT_[name]/[path]` | `[name]` is the name of the Rabbit persistent or job storage as specified in the `name` argument of the `jobdw` or `persistentdw` directive. Any `'-'` in the name from the `jobdw` or `persistentdw` directive should be changed to a `'_'` in the `copy_in` and `copy_out` directive. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the source/destination notes are copy and pasted from the persistent directives
|----------|----------|-------|-------| | ||
| `type` | Yes | `raw`, `xfs`, `gfs2`, `lustre` | Type defines how the storage should be formatted. For Lustre file systems, a single file system is created that is mounted by all computes in the job. For raw, xfs, and GFS2 storage, a separate file system is allocated for each compute node. | | ||
| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | | ||
| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every place in this doc where you say, "String including numbers..." you should say that it must be lowercase letters.
These ultimately refer to k8s resources, which require lowercase.
|
||
### create_persistent | ||
|
||
The `create_persistent` command results in a storage allocation on the Rabbit nodes that lasts beyond the lifetime of the job. This is useful for creating a file system that can share data between jobs. Only a single `create_persistent` directive is allowed in a job, and it cannot be in the same job as a `destroy_persistent` directive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add: See persistent_dw to utilize the storage in a job.
| `capacity` | Yes | Allocation size with units. `1TiB`, `100GB`, etc. | Capacity interpretation varies by storage type. For Lustre file systems, capacity is the aggregate OST capacity. For raw, xfs, and GFS2 storage, capacity is the capacity of the file system for a single compute node. Capacity suffixes are: `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB` | | ||
| `name` | Yes | String including numbers and '-' | This is a name for the storage allocation that is unique within a job | | ||
| `profile` | No | Profile name | This specifies which profile to use when allocating storage. Profiles include `mkfs` and `mount` arguments, file system layout, and many other options. Profiles are created by admins. When no profile is specified, the default profile is used. | | ||
| `requires` | No | `copy-offload` | Using this option results in the copy offload daemon running on the compute nodes. This is for users that want to initiate data movement to or from the Rabbit storage from within their application | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some reference to:
See `RequiredDaemons` in [Directive Breakdown](../directive-breakdown/readme.md) for a description of how the user may request the daemon, in the case where the WLM will run it only on demand.
In case an admin is reading your doc, because the user pointed at it in their complaint, and needs a prod to know where else to look.
#DW copy_out source=$DW_PERSISTENT_shared_data1/b destination=$DW_PERSISTENT_shared_data2/b profile=no-xattr | ||
``` | ||
|
||
This set of directives copies two directories from one persistent storage allocation to another persistent storage allocation using the `no-xattr` profile to avoid copying xattrs. This data movement occurs after the job application exits on the compute nodes, and the two copies do not occur in a guaranteed order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guaranteed => deterministic ?
Signed-off-by: Matt Richerson <[email protected]>
No description provided.