From ac9bb89749785e7c01dddad37f08d425f4d05b38 Mon Sep 17 00:00:00 2001
From: Dean Roehrich <dean.roehrich@hpe.com>
Date: Wed, 31 Jul 2024 15:53:25 -0500
Subject: [PATCH 1/4] Relationship betw cray.nnf.node.drain taint and Storage
 resource.

Signed-off-by: Dean Roehrich <dean.roehrich@hpe.com>
---
 docs/guides/node-management/drain.md | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md
index 9256415..8f996fd 100644
--- a/docs/guides/node-management/drain.md
+++ b/docs/guides/node-management/drain.md
@@ -19,12 +19,30 @@ as it cleans up the NNF pods.
 kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute
 ```
 
+This will cause the node's `Storage` resource to be disabled:
+
+```console
+$ kubectl get storages
+NAME           STATE     STATUS     MODE   AGE
+rabbit1        Enabled   Disabled   Live   3m18s
+rabbit2        Enabled   Ready      Live   3m18s
+```
+
+The `Storage` resource will contain the following message indicating the reason it has been disabled:
+
+```console
+$ kubectl get storages rabbit1 -o json | jq -rM .status.message
+Kubernetes node is tainted with cray.nnf.node.drain
+```
+
 To restore the node to service, remove the `cray.nnf.node.drain` taint.
 
 ```shell
 kubectl taint node $NODE cray.nnf.node.drain-
 ```
 
+The `Storage` resource will revert to a `Ready` status.
+
 ## The CSI Driver
 
 While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so.

From a781677ccecc2057f47941a601043d27c8545f0f Mon Sep 17 00:00:00 2001
From: Dean Roehrich <dean.roehrich@hpe.com>
Date: Thu, 1 Aug 2024 10:28:17 -0500
Subject: [PATCH 2/4] Relationship betw cray.nnf.node.drain taint and Storage
 resource.

The Storage status will be "Drained".

Document how to use the Storage's .spec.state to manually disable a node.

Signed-off-by: Dean Roehrich <dean.roehrich@hpe.com>
---
 docs/guides/index.md                 |  2 +-
 docs/guides/node-management/drain.md | 54 +++++++++++++++++++++++-----
 mkdocs.yml                           |  2 +-
 3 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/docs/guides/index.md b/docs/guides/index.md
index 96dd22d..768d483 100644
--- a/docs/guides/index.md
+++ b/docs/guides/index.md
@@ -24,5 +24,5 @@
 
 ## Node Management
 
-* [Draining A Node](node-management/drain.md)
+* [Disable or Drain a Node](node-management/drain.md)
 * [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md
index 8f996fd..9ea3381 100644
--- a/docs/guides/node-management/drain.md
+++ b/docs/guides/node-management/drain.md
@@ -1,4 +1,40 @@
-# Draining A Node
+# Disable Or Drain A Node
+
+## Disabling a node
+
+A Rabbit node can be manually disabled, indicating to the WLM that it should not schedule more jobs on the node. Jobs currently on the node will be allowed to complete at the discretion of the WLM.
+
+Disable a node by setting its Storage state to `Disabled`.
+
+```shell
+kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Disabled"}]'
+```
+
+When the Storage is queried by the WLM, it will show the disabled status.
+
+```console
+$ kubectl get storages
+NAME           STATE      STATUS     MODE   AGE
+kind-worker2   Enabled    Ready      Live   10m
+kind-worker3   Disabled   Disabled   Live   10m
+```
+
+To re-enable a node, set its Storage state to `Enabled`.
+
+```shell
+kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Enabled"}]'
+```
+
+The Storage state will show that it is enabled.
+
+```console
+kubectl get storages
+NAME           STATE     STATUS   MODE   AGE
+kind-worker2   Enabled   Ready    Live   10m
+kind-worker3   Enabled   Ready    Live   10m
+```
+
+## Draining a node
 
 The NNF software consists of a collection of DaemonSets and Deployments. The pods
 on the Rabbit nodes are usually from DaemonSets. Because of this, the `kubectl drain`
@@ -9,7 +45,7 @@ Given the limitations of DaemonSets, the NNF software will be drained by using t
 as described in
 [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
 
-## Drain NNF Pods From A Rabbit Node
+### Drain NNF pods from a rabbit node
 
 Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint.
 The CSI driver pods will remain on the node to satisfy any unmount requests from k8s
@@ -19,16 +55,16 @@ as it cleans up the NNF pods.
 kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute
 ```
 
-This will cause the node's `Storage` resource to be disabled:
+This will cause the node's `Storage` resource to be drained:
 
 ```console
 $ kubectl get storages
-NAME           STATE     STATUS     MODE   AGE
-rabbit1        Enabled   Disabled   Live   3m18s
-rabbit2        Enabled   Ready      Live   3m18s
+NAME           STATE     STATUS    MODE   AGE
+kind-worker2   Enabled   Drained   Live   5m44s
+kind-worker3   Enabled   Ready     Live   5m45s
 ```
 
-The `Storage` resource will contain the following message indicating the reason it has been disabled:
+The `Storage` resource will contain the following message indicating the reason it has been drained:
 
 ```console
 $ kubectl get storages rabbit1 -o json | jq -rM .status.message
@@ -43,9 +79,9 @@ kubectl taint node $NODE cray.nnf.node.drain-
 
 The `Storage` resource will revert to a `Ready` status.
 
-## The CSI Driver
+### The CSI driver
 
-While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so.
+While the CSI driver pods may be drained from a Rabbit node, it is inadvisable to do so.
 
 **Warning** K8s relies on the CSI driver to unmount any filesystems that may have
 been mounted into a pod's namespace. If it is not present when k8s is attempting
diff --git a/mkdocs.yml b/mkdocs.yml
index 258fec7..6e0535c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -20,7 +20,7 @@ nav:
       - 'User Containers': 'guides/user-containers/readme.md'
       - 'Lustre External MGT': 'guides/external-mgs/readme.md'
       - 'Global Lustre': 'guides/global-lustre/readme.md'
-      - 'Draining A Node': 'guides/node-management/drain.md'
+      - 'Disable or Drain a Node': 'guides/node-management/drain.md'
       - 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md'
       - 'Directive Breakdown': 'guides/directive-breakdown/readme.md'
   - 'RFCs':

From a2d224e206f0862f3ebdf00062f18800e2115f7b Mon Sep 17 00:00:00 2001
From: Dean Roehrich <dean.roehrich@hpe.com>
Date: Thu, 1 Aug 2024 13:30:05 -0500
Subject: [PATCH 3/4] Explain when to use the taint

Signed-off-by: Dean Roehrich <dean.roehrich@hpe.com>
---
 docs/guides/node-management/drain.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md
index 9ea3381..5eeea13 100644
--- a/docs/guides/node-management/drain.md
+++ b/docs/guides/node-management/drain.md
@@ -45,6 +45,10 @@ Given the limitations of DaemonSets, the NNF software will be drained by using t
 as described in
 [Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
 
+This would be used only after the WLM jobs have been removed from that Rabbit (preferably) and there is some reason to also remove the NNF software from it. This might be used before a Rabbit is powered off and pulled out of the cabinet, for example, to avoid leaving pods in "Terminating" state (harmless, but it's noise).
+
+If an admin used this taint before power-off it would mean there wouldn't be "Terminating" pods laying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on.
+
 ### Drain NNF pods from a rabbit node
 
 Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint.

From a22464ccd6e25360c2a79b38766f879f115764a7 Mon Sep 17 00:00:00 2001
From: Dean Roehrich <dean.roehrich@hpe.com>
Date: Thu, 1 Aug 2024 15:06:50 -0500
Subject: [PATCH 4/4] Update docs/guides/node-management/drain.md

Co-authored-by: Blake Devcich <89158881+bdevcich@users.noreply.github.com>
---
 docs/guides/node-management/drain.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guides/node-management/drain.md b/docs/guides/node-management/drain.md
index 5eeea13..8c00a7c 100644
--- a/docs/guides/node-management/drain.md
+++ b/docs/guides/node-management/drain.md
@@ -47,7 +47,7 @@ as described in
 
 This would be used only after the WLM jobs have been removed from that Rabbit (preferably) and there is some reason to also remove the NNF software from it. This might be used before a Rabbit is powered off and pulled out of the cabinet, for example, to avoid leaving pods in "Terminating" state (harmless, but it's noise).
 
-If an admin used this taint before power-off it would mean there wouldn't be "Terminating" pods laying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on.
+If an admin used this taint before power-off it would mean there wouldn't be "Terminating" pods lying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on.
 
 ### Drain NNF pods from a rabbit node