Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relationship betw cray.nnf.node.drain taint and Storage resource. #188

Merged
merged 4 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@

## Node Management

* [Draining A Node](node-management/drain.md)
* [Disable or Drain a Node](node-management/drain.md)
* [Debugging NVMe Namespaces](node-management/nvme-namespaces.md)
66 changes: 62 additions & 4 deletions docs/guides/node-management/drain.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,40 @@
# Draining A Node
# Disable Or Drain A Node

## Disabling a node

A Rabbit node can be manually disabled, indicating to the WLM that it should not schedule more jobs on the node. Jobs currently on the node will be allowed to complete at the discretion of the WLM.

Disable a node by setting its Storage state to `Disabled`.

```shell
kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Disabled"}]'
```

When the Storage is queried by the WLM, it will show the disabled status.

```console
$ kubectl get storages
NAME STATE STATUS MODE AGE
kind-worker2 Enabled Ready Live 10m
kind-worker3 Disabled Disabled Live 10m
```

To re-enable a node, set its Storage state to `Enabled`.

```shell
kubectl patch storage $NODE --type=json -p '[{"op":"replace", "path":"/spec/state", "value": "Enabled"}]'
```

The Storage state will show that it is enabled.

```console
kubectl get storages
NAME STATE STATUS MODE AGE
kind-worker2 Enabled Ready Live 10m
kind-worker3 Enabled Ready Live 10m
```

## Draining a node

The NNF software consists of a collection of DaemonSets and Deployments. The pods
on the Rabbit nodes are usually from DaemonSets. Because of this, the `kubectl drain`
Expand All @@ -9,7 +45,11 @@ Given the limitations of DaemonSets, the NNF software will be drained by using t
as described in
[Taints and Tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).

## Drain NNF Pods From A Rabbit Node
This would be used only after the WLM jobs have been removed from that Rabbit (preferably) and there is some reason to also remove the NNF software from it. This might be used before a Rabbit is powered off and pulled out of the cabinet, for example, to avoid leaving pods in "Terminating" state (harmless, but it's noise).

If an admin used this taint before power-off it would mean there wouldn't be "Terminating" pods lying around for that Rabbit. After a new/same Rabbit is put back in its place, the NNF software won't jump back on it while the taint is present. The taint can be removed at any time, from immediately after the node is powered off up to some time after the new/same Rabbit is powered back on.

### Drain NNF pods from a rabbit node

Drain the NNF software from a node by applying the `cray.nnf.node.drain` taint.
The CSI driver pods will remain on the node to satisfy any unmount requests from k8s
Expand All @@ -19,15 +59,33 @@ as it cleans up the NNF pods.
kubectl taint node $NODE cray.nnf.node.drain=true:NoSchedule cray.nnf.node.drain=true:NoExecute
```

This will cause the node's `Storage` resource to be drained:

```console
$ kubectl get storages
NAME STATE STATUS MODE AGE
kind-worker2 Enabled Drained Live 5m44s
kind-worker3 Enabled Ready Live 5m45s
```

The `Storage` resource will contain the following message indicating the reason it has been drained:

```console
$ kubectl get storages rabbit1 -o json | jq -rM .status.message
Kubernetes node is tainted with cray.nnf.node.drain
```

To restore the node to service, remove the `cray.nnf.node.drain` taint.

```shell
kubectl taint node $NODE cray.nnf.node.drain-
```

## The CSI Driver
The `Storage` resource will revert to a `Ready` status.

### The CSI driver

While the CSI driver pods may be drained from a Rabbit node, it is advisable not to do so.
While the CSI driver pods may be drained from a Rabbit node, it is inadvisable to do so.

**Warning** K8s relies on the CSI driver to unmount any filesystems that may have
been mounted into a pod's namespace. If it is not present when k8s is attempting
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ nav:
- 'User Containers': 'guides/user-containers/readme.md'
- 'Lustre External MGT': 'guides/external-mgs/readme.md'
- 'Global Lustre': 'guides/global-lustre/readme.md'
- 'Draining A Node': 'guides/node-management/drain.md'
- 'Disable or Drain a Node': 'guides/node-management/drain.md'
- 'Debugging NVMe Namespaces': 'guides/node-management/nvme-namespaces.md'
- 'Directive Breakdown': 'guides/directive-breakdown/readme.md'
- 'RFCs':
Expand Down
Loading