Replies: 1 comment 2 replies
-
This is a kernel or hardware failure. Have you tried a different brand of USB SSD controller or drive? I have a pi 4b that has been running in this configuration for several years. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a cluster:
All of these are 8GB Raspberry Pi 5s (except k3spi-ai-worker-1: spec'd out PC running ubuntu server) that run Raspbian OS Lite 64-bit.
k3spi-master-1 boots from an external SSD connected via USB. There's no microSD card attached. It's just an external SSD.
These are the pods that are running on the cluster as whole:
I added a taint to
k3spi-master-1
:I made sure that no pods were running on
k3spi-master-1
directly outside of essential stuff:For a while, everything was running smoothly. I installed longhorn and I noticed some issues. I figured it was because ETCD and longhorn were fighting for disk contention. I added a taint so that no longhorn pods can be added to the master node. I also, made sure that only
k3spi-ai-worker-1
could handle all longhorn related tasks / actions because it's where all the external drives are hosted and is a much beefier machine than the rest.However, lately, things have been interesting. Every day (overnight or after X hours), the master node will go down. This makes the cluster become unavailable on Rancher (hosted on a separate cluster) and causes kubectl to fail to reach the cluster. In addition, the Raspberry Pi goes into Read-only mode and begins to freeze. All services (ssh, running commands, etc) are no longer possible.
This is the only logs that it shows on the output of the actual master node / pi when I connect it to a monitor:
I've tried to restore / fix that partition / drive and it usually shows as restoring / fix to be successful and works fine. Up until this error shows up again. When I restart the pi, the previous error has no real effect on the pi's operations until the pi goes into lockout mode.
The master node appears and feels locked out. Rebooting it from the command line is not possible and when I try to log into the pi itself (non-ssh), it accepts my username but hangs indefinitely when I press enter.
The only way to get the cluster / master node back online is to take the USB-C power cable, disconnect and then, reconnect to turn on the master node again.
This fixes the issue for X hours until it happens again.
This is way out of my expertise to fix. Wondering if anyone else has encountered / knows a direction on how to fix this.
Beta Was this translation helpful? Give feedback.
All reactions