-
Notifications
You must be signed in to change notification settings - Fork 0
Troubleshooting for Microk8s Kubernetes cluster on NREC
After the Microk8s Kubernetes cluster is created, if we have errors or issues accessing the cluster or application, we could check each NREC instance where the cluster node is created.
On each cluster node, we can run:
[rocky@hono-api-prod-01 ~]$ microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 158.37.65.7:19001 158.37.65.60:19001 158.37.65.111:19001
datastore standby nodes: none
addons:
enabled:
cert-manager # (core) Cloud native certificate management
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
ingress # (core) Ingress controller for external access
metrics-server # (core) K8s Metrics Server for API access to service metrics
rbac # (core) Role-Based Access Control for authorisation
disabled:
cis-hardening # (core) Apply CIS K8s hardening
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
gpu # (core) Automatic enablement of Nvidia CUDA
host-access # (core) Allow Pods connecting to Host services smoothly
hostpath-storage # (core) Storage class; allocates storage from host directory
kube-ovn # (core) An advanced network fabric for Kubernetes
mayastor # (core) OpenEBS MayaStor
metallb # (core) Loadbalancer for your Kubernetes cluster
minio # (core) MinIO object storage
observability # (core) A lightweight observability stack for logs, traces and metrics
prometheus # (core) Prometheus operator for monitoring and logging
registry # (core) Private image registry exposed on localhost:32000
rook-ceph # (core) Distributed Ceph storage using Rook
storage # (core) Alias to hostpath-storage add-on, deprecated
and
> microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
Building the report tarball
Report tarball is at /var/snap/microk8s/6750/inspection-report-20240523_161839.tar.gz
If a pod is not behaving as expected, the first port of call should be the logs.
[rocky@hono-api-prod-01 ~]$ kubectl get deploy -n hono-api-prod -o wide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
hono-api 3/3 3 3 7d18h hono-api ghcr.io/uib-ub/uib-ub/uib-ub-monorepo-api:latest app=hono-api
[rocky@hono-api-prod-01 ~]$ kubectl get pod -n hono-api-prod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hono-api-6fd895c8bc-5l2h9 1/1 Running 9 7d18h 10.1.56.98 hono-api-prod-03 <none> <none>
hono-api-6fd895c8bc-92l8p 1/1 Running 10 (9h ago) 7d18h 10.1.27.56 hono-api-prod-02 <none> <none>
hono-api-6fd895c8bc-rfh9r 1/1 Running 8 (9h ago) 7d18h 10.1.53.175 hono-api-prod-01 <none> <none>
Then, check the log of the app:
> kubectl logs -f -n hono-api-prod -l app=hono-api
GET /legacy/groups/:source
GET /legacy/groups/:source/:id
GET /admin/ingest
GET /admin/ingest/manifests
GET /admin/ingest/legacy/ska
GET /admin/ingest/legacy/wab
GET /ns/es/context.json
GET /ns/ubbont/context.json
GET /ns/shacl/context.json
GET /openapi
GET /legacy/groups/:source
GET /legacy/groups/:source/:id
GET /admin/ingest
GET /admin/ingest/manifests
GET /admin/ingest/legacy/ska
GET /admin/ingest/legacy/wab
GET /ns/es/context.json
GET /ns/ubbont/context.json
GET /ns/shacl/context.json
GET /openapi
GET /legacy/groups/:source
GET /legacy/groups/:source/:id
GET /admin/ingest
GET /admin/ingest/manifests
GET /admin/ingest/legacy/ska
GET /admin/ingest/legacy/wab
GET /ns/es/context.json
GET /ns/ubbont/context.json
GET /ns/shacl/context.json
GET /openapi
Or, check the log of a pod:
kubectl logs -f -n hono-api-prod hono-api-6fd895c8bc-5l2h9
GET /
GET /items
GET /items/:id
GET /items/:id/manifest
GET /items/:id/manifest.json
GET /reference
GET /lookup/:id
PUT /admin/es/update-templates
GET /legacy/wab/list
GET /legacy/wab
GET /legacy/items/:source/count
GET /legacy/items/:source
GET /legacy/items/:source/:id
GET /legacy/items/:source/:id/manifest.json
GET /legacy/people/:source/count
GET /legacy/people/:source
GET /legacy/people/:source/:id
GET /legacy/groups/:source/count
GET /legacy/groups/:source
GET /legacy/groups/:source/:id
GET /admin/ingest
GET /admin/ingest/manifests
GET /admin/ingest/legacy/ska
GET /admin/ingest/legacy/wab
GET /ns/es/context.json
GET /ns/ubbont/context.json
GET /ns/shacl/context.json
GET /openapi
Check deployment for all namespaces:
[rocky@hono-api-prod-01 ~]$ kubectl get deployment --all-namespaces
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
cattle-fleet-local-system fleet-agent 1/1 1 1 77d
cattle-fleet-system fleet-agent 1/1 1 1 75d
cattle-fleet-system fleet-controller 1/1 1 1 90d
cattle-fleet-system gitjob 1/1 1 1 90d
cattle-system cattle-cluster-agent 2/2 2 2 75d
cattle-system rancher-webhook 1/1 1 1 75d
cert-manager cert-manager 1/1 1 1 90d
cert-manager cert-manager-cainjector 1/1 1 1 90d
cert-manager cert-manager-webhook 1/1 1 1 90d
default github-deploy-hono 1/1 1 1 23d
hono-api-prod hono-api 3/3 3 3 7d18h
hono-api-test hono-api 3/3 3 3 7d17h
kube-system calico-kube-controllers 1/1 1 1 90d
kube-system coredns 1/1 1 1 90d
kube-system metrics-server 1/1 1 1 90d
monitoring kube-state-metrics 3/3 3 3 89d
For example, if we want to restart the deployment hono-api
, then we run
kubectl rollout restart deployment/hono-api -n hono-api-prod
Take hono-api as an example, to scale the deployment down to 0 replicas (to stop the deployment),
kubectl scale deployment/hono-api --replicas=0 -n hono-api-prod
or, scale up to 3 replicas:
kubectl scale deployment/hono-api --replicas=3 -n hono-api-prod
For example:
> kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
hono-api-prod-01 276m 13% 4714Mi 62%
hono-api-prod-02 252m 12% 4302Mi 56%
hono-api-prod-03 311m 15% 4709Mi 62%
> kubectl top pod -n hono-api-prod
NAME CPU(cores) MEMORY(bytes)
hono-api-54b78fdc5d-4nj55 1m 182Mi
hono-api-54b78fdc5d-hl9nh 1m 181Mi
hono-api-54b78fdc5d-s7lgq 1m 174Mi
Sometimes, the problem is that pods (or cluster nodes) cannot communicate with each other. So, to test communication, we can create debug pods on each NREC instance, eg.:
(assume that we have 3 NREC instances of the cluster with hostnames: hono-api-prod-01, hono-api-prod-02, and hono-api-prod-03, respectively)
on NREC instance node one:
kubectl run -i --tty --rm debug-n1 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-01"}}}' -- sh
on NREC instance node two:
kubectl run -i --tty --rm debug-n2 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-02"}}}' -- sh
on NREC instance node three:
kubectl run -i --tty --rm debug-n3 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-03"}}}' -- sh
Then, we get the IPs of these debug pods:
> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
debug-n1 1/1 Running 0 3m39s 10.1.167.31 hono-api-prod-01 <none> <none>
debug-n2 1/1 Running 0 98s 10.1.76.55 hono-api-prod-02 <none> <none>
debug-n3 1/1 Running 0 76s 10.1.42.16 hono-api-prod-03 <none> <none>
Then, within each debug pod, we can ping these IPs from one node to another, and also ping websites like google.com, to check communication,
eg. from pod debug-n3, we ping IP 10.1.167.31 to check the communication to pod debug-n1
> kubectl run -i --tty --rm debug-n3 --image=busybox --overrides='{"apiVersion":"v1", "spec":{"nodeSelector":{"kubernetes.io/hostname":"hono-api-prod-03"}}}' -- sh
If you don't see a command prompt, try pressing enter.
/ #
/ #
/ #
/ # ping 10.1.167.31
PING 10.1.167.31 (10.1.167.31): 56 data bytes
64 bytes from 10.1.167.31: seq=0 ttl=62 time=0.545 ms
64 bytes from 10.1.167.31: seq=1 ttl=62 time=0.396 ms
64 bytes from 10.1.167.31: seq=2 ttl=62 time=0.400 ms
64 bytes from 10.1.167.31: seq=3 ttl=62 time=0.396 ms
64 bytes from 10.1.167.31: seq=4 ttl=62 time=0.351 ms
64 bytes from 10.1.167.31: seq=5 ttl=62 time=0.331 ms
^C
--- 10.1.167.31 ping statistics ---
6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max = 0.331/0.403/0.545 ms
we can also ping google.com to see if we can get packets transmitted:
/ # ping google.com
PING google.com (172.217.21.174): 56 data bytes
64 bytes from 172.217.21.174: seq=0 ttl=114 time=7.749 ms
64 bytes from 172.217.21.174: seq=1 ttl=114 time=7.846 ms
64 bytes from 172.217.21.174: seq=2 ttl=114 time=7.860 ms
64 bytes from 172.217.21.174: seq=3 ttl=114 time=7.772 ms
64 bytes from 172.217.21.174: seq=4 ttl=114 time=7.790 ms
^C
--- google.com ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 7.749/7.803/7.860 ms
/ # nslookup google.com
Server: 10.152.183.10
Address: 10.152.183.10:53
Non-authoritative answer:
Name: google.com
Address: 173.194.73.100
Name: google.com
Address: 173.194.73.101
Name: google.com
Address: 173.194.73.139
Name: google.com
Address: 173.194.73.113
Name: google.com
Address: 173.194.73.102
Name: google.com
Address: 173.194.73.138
Non-authoritative answer:
Name: google.com
Address: 2a00:1450:400f:804::200e
If we cannot transmit packets and have packet loss, one possible reason could be from Microk8s addon dns(CoreDNS). So, we can try to disable dns addon, and then re-enable it, then see if you can fix the communication problem.
To disable dns, we only need to do it once on one of the cluster nodes, for example on hono-api-prod-01:
microk8s disable dns
then, we re-enable it, by running
microk8s enable dns
Ps, we have to ensure that in the end, we have addons including below for the cluster:
cert-manager # (core) Cloud native certificate management
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm # (core) Helm - the package manager for Kubernetes
helm3 # (core) Helm 3 - the package manager for Kubernetes
ingress # (core) Ingress controller for external access
rbac # (core) Role-Based Access Control for authorisation
If the problem still exists, then it might be due to docker and docker swarm network (docker_gwbridge) having conflicts with Microk8s' calico, so we need to disable the docker swarm and remove the docker swarm network (docker_gwbridge).
Please Note: to check the network interfaces of the host:
> ip -c -br link
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0 UP fa:16:3e:11:2b:5c <BROADCAST,MULTICAST,UP,LOWER_UP>
cali4b758e717db@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali1e53915acba@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali38f73f8377a@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali8a16fe0ab17@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali4dedbda4352@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali00024db6c64@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
vxlan.calico UNKNOWN 66:4e:a0:05:ff:74 <BROADCAST,MULTICAST,UP,LOWER_UP>
cali82047bbedb0@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali717ecdeb0c4@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
cali58b75994b0b@if3 UP ee:ee:ee:ee:ee:ee <BROADCAST,MULTICAST,UP,LOWER_UP>
vxlan.calico
is a VXLAN network device used by Calico for overlay networking across multiple nodes. VXLAN is a network overlay technology that allows for the creation of a Layer 2 network on top of a Layer 3 network. It is commonly used in cloud computing environments to enable large-scale multi-tenant environments.
So, the docker_gwbridge network used by docker and the vxlan.calico network used by Microk8s has a conflict.
Rancher might have problems between nodes. Try scale replicas down to 1 to see if Rancher can run successfully:
kubectl scale deployment/rancher -n cattle-system --replicas=1
note: check the pods of rancher by running:
kubectl describe pod -l app=rancher -n cattle-system
If it can run successfully, then Go to Rancher UI to scale up to 2 and 3, meanwhile checking logs: