-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control plane API stops responding #5072
Comments
It might be helpful to access console logs or video output of the VM once it is in this "hanging" state. Talos can also stream the logs at least to the moment it hangs via My only guess is that VM doesn't have enough resources to run the |
I've doubled the resources for the test VM from ones states in the docs to 8GB memory and 20GB disk space. After reboot I did a health check every 10 seconds. 13 succeeded, the 14th got stuck at 11:55:10 with:
The Output from |
I don't see anything in the logs which my point towards the problem. My only guess is that CNI messes up networking in some way? Or some other privileged workload? I don't see any problem from Talos side right now. |
It looks like this depends on the number of connections. The VM can run unused overnight and still accept connections the next day. But a |
we don't have any specific way to do more debugging. one way might be to schedule a privileged pod on the node via Kubernetes and try regular Linux troubleshooting if there's any sign of some resource exhaustion. I'm not sure where to even look at right now. In terms of resource usage |
I have similar issue described here #8049, exactly same conditions and also talos api hangs. Overnights everything was fine then the next day I ran |
We also experience the same thing in VMWare. Using We tried E1000, E1000E and VMXNET3 NIC types to rule out issues there. We managed to stop the apid proces/container and when it restarts the problem is "reset" the same way a reboot resets it... until the next day / Any other suggestions? |
@RvRuttenFS Have you tried with older talos versions? 1.6.0 etc.? Try also OVA 1.6.7 (released two days ago) |
Thanks for your suggestion. Yes, we tried OVA 1.6.0, 1.65, 1.6.6 and 1.6.7. We think apid is somehow partially crashing (or maybe some other components behind it). |
If Quick check for |
So you have tried all the stuff I did, :/ What version of vmware you are running? Have you tried to install it from ISO and not from OVA? (be sure to change disk settings in order for them to be seen by talos) |
@smira
So no info/response there. Looking at the log ( Any suggestions where or what kind of other logs may help in this? |
I'm confused - how can you access apid logs if you can't access API? Look at the console/serial logs. If nothing there, I'd assume it's not Talos. Talos does a healtcheck on apid, so if it stops responding, it should print to the console. |
We asked ourselves the same question, that's why we said "partially" on purpose. Is there anything else we can check or confirm to make sense of this weird behavior? After killing the apid proces (through a privileged debug pod) a new apid process starts and that does reset something, as we can then run talosctl again. Same effect as a reboot of the node. I also noticed 502 Bad Gateway errors sometimes when using talosctl. Lastly, I now see I forgot to mention we use Omni SaaS - if that makes any difference. |
If using Omni, you have console logs available in the machine view. And it'd better to create an issue in the Omni repo. Next release of Omni should have |
Seems we have found something. As I highjacked this issue for a bit it seems fair to update on what caused it for us. In our cluster patch yaml file in Omni we had Not sure if @Filip7656 has used this debug key too, but if you did - now you know this is something that should not be used. Thanks everyone! |
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Anecdotal, but I believe I ran something similar to this after setting If debug logs are crashing apid, it does make it hard to understand the issue, especially when I'm enabling debug logs as a Talos novice. I'm testing on a Ubuntu 22.04 host, with libvirt/KVM/QEMU and a 4 core, 8GB RAM, 40GB disk guest running Talos 1.7.6. |
@smira Thanks for expanding; I wasn't aware of that extra note. This seems separate from this apid crash, but from the perspective of a new Talos user, the extra note is hard to find or could be written differently if debug mode is discouraged. I've been primarily working with the default |
Yeah, a proper move for us is to hide this knob completely, I'll look into that, certainly not something that people should use. |
Bug Report
Description
I've set up a control plane according to the docs for vmware. The only thing changed in
controlplane.yaml
is static network configuration. The node boots up correctly, but after some time stops responding.talosctl
just hangs for any command without any output. I can still ping the node, andkubectl get pods
works and shows the default pods as running. The VM shows no errors in its console. Logs sent to a TCP endpoint also do not show any errors. After a reboot it seems to work again, but runningtalosctl health
a few times and the node stops responding again after a few minutes. Even if left alone this seems to happen after a day or so.Logs
Not sure what I should append here. A rebooted node doesn't seem to show logs from the previous run, and once this state is reached I cannot get any logs.
Environment
The text was updated successfully, but these errors were encountered: