Skip to content

Troubleshooting for Microk8s Kubernetes cluster on NREC

Rui Wang edited this page Sep 19, 2024 · 14 revisions

Troubleshooting

After the Microk8s Kubernetes cluster is created, if we have errors or issues accessing the cluster or application, we could check each NREC instance where the cluster node is created.

Microk8s Status

On each cluster node, we can run:

[rocky@hono-api-prod-01 ~]$ microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 158.37.65.7:19001 158.37.65.60:19001 158.37.65.111:19001
  datastore standby nodes: none
addons:
  enabled:
    cert-manager         # (core) Cloud native certificate management
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    ingress              # (core) Ingress controller for external access
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rbac                 # (core) Role-Based Access Control for authorisation
  disabled:
    cis-hardening        # (core) Apply CIS K8s hardening
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    minio                # (core) MinIO object storage
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    registry             # (core) Private image registry exposed on localhost:32000
    rook-ceph            # (core) Distributed Ceph storage using Rook
    storage              # (core) Alias to hostpath-storage add-on, deprecated

and

> microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/6750/inspection-report-20240523_161839.tar.gz

Logs of app/pods

If a pod is not behaving as expected, the first port of call should be the logs.

[rocky@hono-api-prod-01 ~]$ kubectl get deploy -n hono-api-prod -o wide
NAME       READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS   IMAGES                                             SELECTOR
hono-api   3/3     3            3           7d18h   hono-api     ghcr.io/uib-ub/uib-ub/uib-ub-monorepo-api:latest   app=hono-api
[rocky@hono-api-prod-01 ~]$ kubectl get pod -n hono-api-prod -o wide
NAME                        READY   STATUS    RESTARTS      AGE     IP            NODE               NOMINATED NODE   READINESS GATES
hono-api-6fd895c8bc-5l2h9   1/1     Running   9             7d18h   10.1.56.98    hono-api-prod-03   <none>           <none>
hono-api-6fd895c8bc-92l8p   1/1     Running   10 (9h ago)   7d18h   10.1.27.56    hono-api-prod-02   <none>           <none>
hono-api-6fd895c8bc-rfh9r   1/1     Running   8 (9h ago)    7d18h   10.1.53.175   hono-api-prod-01   <none>           <none>

Then, check the log of the app:

> kubectl logs -f -n hono-api-prod -l app=hono-api
GET  /legacy/groups/:source
GET  /legacy/groups/:source/:id
GET  /admin/ingest
GET  /admin/ingest/manifests
GET  /admin/ingest/legacy/ska
GET  /admin/ingest/legacy/wab
GET  /ns/es/context.json
GET  /ns/ubbont/context.json
GET  /ns/shacl/context.json
GET  /openapi
GET  /legacy/groups/:source
GET  /legacy/groups/:source/:id
GET  /admin/ingest
GET  /admin/ingest/manifests
GET  /admin/ingest/legacy/ska
GET  /admin/ingest/legacy/wab
GET  /ns/es/context.json
GET  /ns/ubbont/context.json
GET  /ns/shacl/context.json
GET  /openapi
GET  /legacy/groups/:source
GET  /legacy/groups/:source/:id
GET  /admin/ingest
GET  /admin/ingest/manifests
GET  /admin/ingest/legacy/ska
GET  /admin/ingest/legacy/wab
GET  /ns/es/context.json
GET  /ns/ubbont/context.json
GET  /ns/shacl/context.json
GET  /openapi

Or, check the log of a pod:

kubectl logs -f -n hono-api-prod hono-api-6fd895c8bc-5l2h9 
GET  /
GET  /items
GET  /items/:id
GET  /items/:id/manifest
GET  /items/:id/manifest.json
GET  /reference
GET  /lookup/:id
PUT  /admin/es/update-templates
GET  /legacy/wab/list
GET  /legacy/wab
GET  /legacy/items/:source/count
GET  /legacy/items/:source
GET  /legacy/items/:source/:id
GET  /legacy/items/:source/:id/manifest.json
GET  /legacy/people/:source/count
GET  /legacy/people/:source
GET  /legacy/people/:source/:id
GET  /legacy/groups/:source/count
GET  /legacy/groups/:source
GET  /legacy/groups/:source/:id
GET  /admin/ingest
GET  /admin/ingest/manifests
GET  /admin/ingest/legacy/ska
GET  /admin/ingest/legacy/wab
GET  /ns/es/context.json
GET  /ns/ubbont/context.json
GET  /ns/shacl/context.json
GET  /openapi

Rolling restart a deployment

Check deployment for all namespaces:

[rocky@hono-api-prod-01 ~]$ kubectl get deployment --all-namespaces
NAMESPACE                   NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
cattle-fleet-local-system   fleet-agent               1/1     1            1           77d
cattle-fleet-system         fleet-agent               1/1     1            1           75d
cattle-fleet-system         fleet-controller          1/1     1            1           90d
cattle-fleet-system         gitjob                    1/1     1            1           90d
cattle-system               cattle-cluster-agent      2/2     2            2           75d
cattle-system               rancher-webhook           1/1     1            1           75d
cert-manager                cert-manager              1/1     1            1           90d
cert-manager                cert-manager-cainjector   1/1     1            1           90d
cert-manager                cert-manager-webhook      1/1     1            1           90d
default                     github-deploy-hono        1/1     1            1           23d
hono-api-prod               hono-api                  3/3     3            3           7d18h
hono-api-test               hono-api                  3/3     3            3           7d17h
kube-system                 calico-kube-controllers   1/1     1            1           90d
kube-system                 coredns                   1/1     1            1           90d
kube-system                 metrics-server            1/1     1            1           90d
monitoring                  kube-state-metrics        3/3     3            3           89d

For example, if we want to restart the deployment hono-api, then we run

kubectl rollout restart deployment/hono-api -n hono-api-prod   

Scale a deployment

Take hono-api as an example, to scale the deployment down to 0 replicas (to stop the deployment),

kubectl scale deployment/hono-api --replicas=0 -n hono-api-prod 

or, scale up to 3 replicas:

kubectl scale deployment/hono-api --replicas=3 -n hono-api-prod

Display resource (CPU/memory) metrics

For example:

> kubectl top node
NAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
hono-api-prod-01   276m         13%    4714Mi          62%
hono-api-prod-02   252m         12%    4302Mi          56%
hono-api-prod-03   311m         15%    4709Mi          62%
> kubectl top pod -n hono-api-prod
NAME                        CPU(cores)   MEMORY(bytes)
hono-api-54b78fdc5d-4nj55   1m           182Mi
hono-api-54b78fdc5d-hl9nh   1m           181Mi
hono-api-54b78fdc5d-s7lgq   1m           174Mi

Debug communication between pods

Sometimes, the problem is that pods (or cluster nodes) cannot communicate with each other. So, to test communication, we can create debug pods on each NREC instance, eg.:

(assume that we have 3 NREC instances of the cluster with hostnames: hono-api-prod-01, hono-api-prod-02, and hono-api-prod-03, respectively)

on NREC instance node one:

kubectl run -i --tty --rm debug-n1 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-01"}}}' -- sh

on NREC instance node two:

kubectl run -i --tty --rm debug-n2 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-02"}}}' -- sh

on NREC instance node three:

kubectl run -i --tty --rm debug-n3 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-03"}}}' -- sh

Then, we get the IPs of these debug pods:

> kubectl get pods -o wide
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
debug-n1                    1/1     Running   0          3m39s   10.1.167.31   hono-api-prod-01   <none>           <none>
debug-n2                    1/1     Running   0          98s     10.1.76.55    hono-api-prod-02   <none>           <none>
debug-n3                    1/1     Running   0          76s     10.1.42.16    hono-api-prod-03   <none>           <none>

Then, within each debug pod, we can ping these IPs from one node to another, and also ping websites like google.com, to check communication,

eg. from pod debug-n3, we ping IP 10.1.167.31 to check the communication to pod debug-n1

> kubectl run -i --tty --rm debug-n3 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-03"}}}' -- sh
If you don't see a command prompt, try pressing enter.
/ #
/ #
/ #
/ # ping 10.1.167.31
PING 10.1.167.31 (10.1.167.31): 56 data bytes
64 bytes from 10.1.167.31: seq=0 ttl=62 time=0.545 ms
64 bytes from 10.1.167.31: seq=1 ttl=62 time=0.396 ms
64 bytes from 10.1.167.31: seq=2 ttl=62 time=0.400 ms
64 bytes from 10.1.167.31: seq=3 ttl=62 time=0.396 ms
64 bytes from 10.1.167.31: seq=4 ttl=62 time=0.351 ms
64 bytes from 10.1.167.31: seq=5 ttl=62 time=0.331 ms
^C
--- 10.1.167.31 ping statistics ---
6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max = 0.331/0.403/0.545 ms

we can also ping google.com to see if we can get packets transmitted:

/ # ping google.com
PING google.com (172.217.21.174): 56 data bytes
64 bytes from 172.217.21.174: seq=0 ttl=114 time=7.749 ms
64 bytes from 172.217.21.174: seq=1 ttl=114 time=7.846 ms
64 bytes from 172.217.21.174: seq=2 ttl=114 time=7.860 ms
64 bytes from 172.217.21.174: seq=3 ttl=114 time=7.772 ms
64 bytes from 172.217.21.174: seq=4 ttl=114 time=7.790 ms
^C
--- google.com ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 7.749/7.803/7.860 ms
/ # nslookup google.com
Server:		10.152.183.10
Address:	10.152.183.10:53

Non-authoritative answer:
Name:	google.com
Address: 173.194.73.100
Name:	google.com
Address: 173.194.73.101
Name:	google.com
Address: 173.194.73.139
Name:	google.com
Address: 173.194.73.113
Name:	google.com
Address: 173.194.73.102
Name:	google.com
Address: 173.194.73.138

Non-authoritative answer:
Name:	google.com
Address: 2a00:1450:400f:804::200e

If we cannot transmit packets and have packet loss, one possible reason could be from Microk8s addon dns(CoreDNS). So, we can try to disable dns addon, and then re-enable it, then see if you can fix the communication problem.

To disable dns, we only need to do it once on one of the cluster nodes, for example on hono-api-prod-01:

microk8s disable dns

then, we re-enable it, by running

microk8s enable dns

Ps, we have to ensure that in the end, we have addons including below for the cluster:

    cert-manager         # (core) Cloud native certificate management
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    ingress              # (core) Ingress controller for external access
    rbac                 # (core) Role-Based Access Control for authorisation

If the problem still exists, then it might be due to docker and docker swarm network (docker_gwbridge) having conflicts with Microk8s' calico, so we need to disable the docker swarm and remove the docker swarm network (docker_gwbridge).

Please Note: to check the network interfaces of the host:

> ip -c -br link
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0             UP             fa:16:3e:11:2b:5c <BROADCAST,MULTICAST,UP,LOWER_UP>
cali4b758e717db@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali1e53915acba@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali38f73f8377a@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali8a16fe0ab17@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali4dedbda4352@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali00024db6c64@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
vxlan.calico     UNKNOWN        66:4e:a0:05:ff:74 <BROADCAST,MULTICAST,UP,LOWER_UP>
cali82047bbedb0@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali717ecdeb0c4@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali58b75994b0b@if3 UP             ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>

vxlan.calico is a VXLAN network device used by Calico for overlay networking across multiple nodes. VXLAN is a network overlay technology that allows for the creation of a Layer 2 network on top of a Layer 3 network. It is commonly used in cloud computing environments to enable large-scale multi-tenant environments.

So, the docker_gwbridge network used by docker and the vxlan.calico network used by Microk8s has a conflict.

Rancher

Rancher might have problems between nodes. Try scale replicas down to 1 to see if Rancher can run successfully:

kubectl scale deployment/rancher -n cattle-system --replicas=1

note: check the pods of rancher by running:

kubectl describe pod -l app=rancher -n cattle-system

If it can run successfully, then Go to Rancher UI to scale up to 2 and 3, meanwhile checking logs:

Screenshot 2024-06-21 at 11 24 57