Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kured not rebooting node with example var/run/reboot-required file #952

Open
chawleejay opened this issue Jul 9, 2024 · 12 comments
Open
Labels

Comments

@chawleejay
Copy link

Hello

I am trying to get Kured back up and running. The logs show
time="2024-07-09T04:31:18Z" level=info msg="Reboot not required" time="2024-07-09T05:31:18Z" level=info msg="Reboot not required" time="2024-07-09T06:31:18Z" level=info msg="Reboot not required"

but there is a reboot-required file on the node.

Not sure why this is happening. Im using Kured v5.4.0

Thanks

@ckotzbauer
Copy link
Member

Hi @chawleejay,
can you please post your current Kured configuration and your installation method here? Otherwise we can't figure out what's happening, thanks.

@chawleejay
Copy link
Author

chawleejay commented Jul 15, 2024

kured is installed and pods are up and running. the pod logs show "reboot not required"

The node has the reboot required file placed inside via the command touch /var/run/reboot-required

image

  template:
    metadata:
      name: 'kured-{{name}}'
    spec:
      project: kured
      source:
        chart: kured
        helm:
          valueFiles:
            - values.yaml
          releaseName: kured
          values: |
            tolerations:
              - key: node-role.kubernetes.io/master
                effect: NoSchedule
              - key: workload-type
                value: confluent
                effect: NoSchedule       
            updateStrategy: RollingUpdate
            maxUnavailable: 1
            configuration:
              period: 5h0m0s    
              rebootDays: {{rebootDays}}    
              lockTtl: 30m    
              timeZone: America/Phoenix
              notifyUrl: {{notifyUrl}}
        repoURL: 'https://kubereboot.github.io/charts'
        targetRevision: 5.4.0
      destination:
        server: '{{address}}'
        namespace: '{{namespace}}'

@ckotzbauer

@ckotzbauer
Copy link
Member

Okay, I'm still not sure how kured is configured in your installation, the yaml is not clear about that. Can you please post the output of kubectl get daemonset -n <namespace> kured -o yaml here?

@chawleejay
Copy link
Author

chawleejay commented Jul 16, 2024

creationTimestamp: "2022-09-07T17:14:52Z"
  generation: 16
  labels:
    app.kubernetes.io/instance: kured-devops
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kured
    helm.sh/chart: kured-5.4.0
    k8slens-edit-resource-version: v1
  name: kured
  namespace: kube-system
  resourceVersion: "3775521503"
  uid: b07427ea-5345-4bd0-bbaa-be3d4da149eb
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: kured
      app.kubernetes.io/name: kured
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: kured
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: kured
        helm.sh/chart: kured-5.4.0
    spec:
      containers:
      - args:
        - --ds-name=kured
        - --ds-namespace=kube-system
        - --metrics-port=8080
        - --lock-ttl=30m
        - --period=0h0m30s
        - --force-reboot=true
        - --reboot-command=/bin/systemctl reboot
        - --notify-url=slack://KuredDevOps@ourtoken
        - --time-zone=America/Phoenix
        - --log-format=text
        - --concurrency=1
        command:
        - /usr/bin/kured
        env:
        - name: KURED_NODE_ID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: ghcr.io/kubereboot/kured:1.15.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: metrics
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        name: kured
        ports:
        - containerPort: 8080
          hostPort: 8080
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: metrics
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      hostPID: true
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kured
      serviceAccountName: kured
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        key: workload-type
        value: confluent
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 3
  desiredNumberScheduled: 3
  numberAvailable: 3
  numberMisscheduled: 0
  numberReady: 3
  observedGeneration: 16
  updatedNumberScheduled: 3

just added the --force-reboot=true today and still nothing. Thank you

@jackfrancis
Copy link
Collaborator

@chawleejay do you see this in the logs:

"sentinel command ended with unexpected exit code"...

If not, then based on your config it seems that test -f /var/run/reboot-required returned a 1 exit code, indicating that the file doesn't exist.

@ryayon
Copy link

ryayon commented Jul 30, 2024

Hello,

I have the same issue on Ubuntu nodes.

If I check the existence of the file directly on the node, I get:

$ test -f /var/run/reboot-required
$ echo $?
0

While, if I run the same command from the pod of the same node, I get:

# test -f /var/run/reboot-required
# echo $?
1

In addition, here is the content of /var/run in the pod:

# ls /var/run/
secrets

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@evrardjp
Copy link
Collaborator

evrardjp commented Oct 18, 2024

Our CI works in the following way:

  • We have a volumeMount with /sentinel to mount the host files
  • We use the - --reboot-sentinel=/sentinel/reboot-required

However, this should work by default: If you don't pass sentinel-command, it should watch for /var/run/reboot-required from nsentering pid1.

Did you try running the command /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required , and see its result?

@urbaman
Copy link

urbaman commented Oct 20, 2024

Hi

I have the same problem in a microk8s deploy on Ubuntu 24.04, node mk8s1, kured 1.16.0 deployed with manifests, the file was clearly present at the time of the logs:

# ls -la /var/run/reboot*
-rw-r--r-- 1 root root 32 Oct 17 06:36 /var/run/reboot-required
-rw-r--r-- 1 root root 40 Oct 17 06:36 /var/run/reboot-required.pkgs
# kubectl get pods -n kube-system -o wide
NAME                                       READY   STATUS    RESTARTS      AGE   IP             NODE    NOMINATED NODE   READINESS GATES
kured-6zxvf                                1/1     Running   6 (9d ago)    12d   10.1.217.210   mk8s3   <none>           <none>
kured-gss75                                1/1     Running   3 (9d ago)    12d   10.1.238.130   mk8s1   <none>           <none>
kured-z4wg8                                1/1     Running   2 (12d ago)   12d   10.1.115.130   mk8s2   <none>           <none>
# kubectl logs -n kube-system kured-gss75
time="2024-10-10T23:25:22Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2024-10-10T23:25:22Z" level=info msg="Kubernetes Reboot Daemon: 1.16.0"
time="2024-10-10T23:25:22Z" level=info msg="Node ID: mk8s1"
time="2024-10-10T23:25:22Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2024-10-10T23:25:22Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2024-10-10T23:25:22Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2024-10-10T23:25:22Z" level=info msg="PreferNoSchedule taint: "
time="2024-10-10T23:25:22Z" level=info msg="Blocking Pod Selectors: []"
time="2024-10-10T23:25:22Z" level=info msg="Reboot schedule: ---MonTueWedThuFri--- between 10:00 and 17:00 Europe/Rome"
time="2024-10-10T23:25:22Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s"
time="2024-10-10T23:25:22Z" level=info msg="Concurrency: 1"
time="2024-10-10T23:25:22Z" level=info msg="Reboot method: command"
time="2024-10-10T23:25:22Z" level=info msg="Reboot signal: 39"
time="2024-10-11T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-11T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-14T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-15T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-16T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-17T14:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T08:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T09:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T10:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T11:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T12:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T13:12:16Z" level=info msg="Reboot not required"
time="2024-10-18T14:12:16Z" level=info msg="Reboot not required"
# kubectl exec -ti -n kube-system kured-gss75 -- /bin/sh
/ # test -f /var/run/reboot-required
/ # echo $?
1
/ # /usr/bin/nsenter -m/proc/1/ns/mnt -- test -f /var/run/reboot-required
/ # echo $?
0
# test -f /var/run/reboot-required
# echo $?
0

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@urbaman
Copy link

urbaman commented Dec 20, 2024

Hi, anyone has the same problem and resolved?

@evrardjp
Copy link
Collaborator

Hello, the issue is still unclear to me.
Did you try exposing the sentinel and not do the nsenter? it makes it easier to deal with, should pid to trigger not be pid1...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants