Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control plane API stops responding #5072

Open
gix opened this issue Mar 3, 2022 · 21 comments
Open

Control plane API stops responding #5072

gix opened this issue Mar 3, 2022 · 21 comments

Comments

@gix
Copy link

gix commented Mar 3, 2022

Bug Report

Description

I've set up a control plane according to the docs for vmware. The only thing changed in controlplane.yaml is static network configuration. The node boots up correctly, but after some time stops responding.

talosctl just hangs for any command without any output. I can still ping the node, and kubectl get pods works and shows the default pods as running. The VM shows no errors in its console. Logs sent to a TCP endpoint also do not show any errors. After a reboot it seems to work again, but running talosctl health a few times and the node stops responding again after a few minutes. Even if left alone this seems to happen after a day or so.

Logs

Not sure what I should append here. A rebooted node doesn't seem to show logs from the previous run, and once this state is reached I cannot get any logs.

Environment

  • Talos version: v0.14.1
  • Kubernetes version: v1.23.0 (Client), v1.23.1 (Server)
  • Platform: VMware ESXi 6.7
@smira
Copy link
Member

smira commented Mar 3, 2022

It might be helpful to access console logs or video output of the VM once it is in this "hanging" state.

Talos can also stream the logs at least to the moment it hangs via talosctl dmesg -f.

My only guess is that VM doesn't have enough resources to run the apid process, but this is a wild guess.

@gix
Copy link
Author

gix commented Mar 7, 2022

I've doubled the resources for the test VM from ones states in the docs to 8GB memory and 20GB disk space. After reboot I did a health check every 10 seconds. 13 succeeded, the 14th got stuck at 11:55:10 with:

discovered nodes: control plane: ["10.1.0.191"], worker: []
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The dmesg -f continues afterwards, showing only NTP messages. Every newly submitted talosctl command hangs.

Output from dmesg -f: dmesg-f.log
Logs received by a TCP collector: received.log

@smira
Copy link
Member

smira commented Mar 9, 2022

I don't see anything in the logs which my point towards the problem. My only guess is that CNI messes up networking in some way? Or some other privileged workload?

I don't see any problem from Talos side right now.

@gix
Copy link
Author

gix commented Mar 10, 2022

It looks like this depends on the number of connections. The VM can run unused overnight and still accept connections the next day. But a talosctl support after a reboot will hang midway through. Is there any way to get more debug output in the console (in addition to debug: true? A debug build of talos?

@smira
Copy link
Member

smira commented Mar 10, 2022

we don't have any specific way to do more debugging. one way might be to schedule a privileged pod on the node via Kubernetes and try regular Linux troubleshooting if there's any sign of some resource exhaustion. I'm not sure where to even look at right now.

In terms of resource usage talosctl dashboard might help. As for the API connections, talosctl logs -f apid and talosctl logs -f machined might help.

@Filip7656
Copy link

I have similar issue described here #8049, exactly same conditions and also talos api hangs. Overnights everything was fine then the next day I ran talosctl health and it hanged on disk checking. After that i couldnt access other talosctl commands.
I think it has something to do with VMware networking.

@RvRuttenFS
Copy link

We also experience the same thing in VMWare. Using talosctl services on a watch will also end up in the same state after a couple of minutes. Looked at #8049 too and have the same log output (or lack of). We used OVA 1.6.6.

We tried E1000, E1000E and VMXNET3 NIC types to rule out issues there.

We managed to stop the apid proces/container and when it restarts the problem is "reset" the same way a reboot resets it... until the next day / talosctl services or talosctl health.

Any other suggestions?

@Filip7656
Copy link

@RvRuttenFS Have you tried with older talos versions? 1.6.0 etc.? Try also OVA 1.6.7 (released two days ago)
This issue took two weeks to resolve, and my solution was upgrading to newer version which had newer linux kernel.

@RvRuttenFS
Copy link

Thanks for your suggestion. Yes, we tried OVA 1.6.0, 1.65, 1.6.6 and 1.6.7.
Also we tried many settings like Static IP and DHCP, flannel and cillium CNI, NTP on and off.

We think apid is somehow partially crashing (or maybe some other components behind it).
https://www.talos.dev/v1.6/learn-more/components/#components

@smira
Copy link
Member

smira commented Mar 22, 2024

If apid is crashing, you will see it in the logs.

Quick check for apid from outside is to do talosctl --endpoint IP --nodes IP version, this API should always respond as long as apid is still listening.

@Filip7656
Copy link

So you have tried all the stuff I did, :/ What version of vmware you are running? Have you tried to install it from ISO and not from OVA? (be sure to change disk settings in order for them to be seen by talos)

@RvRuttenFS
Copy link

RvRuttenFS commented Mar 22, 2024

@smira
Ran the talosctl command and got back:

Client:
	Tag:         v1.6.2
	SHA:         26eee755
	Built:
	Go version:  go1.21.6 X:loopvar
	OS/Arch:     darwin/arm64
Server:

So no info/response there.

Looking at the log (talosctl logs apid --tail -1) only shows 1 older entry.

Any suggestions where or what kind of other logs may help in this?

@smira
Copy link
Member

smira commented Mar 22, 2024

I'm confused - how can you access apid logs if you can't access API?

Look at the console/serial logs. If nothing there, I'd assume it's not Talos.

Talos does a healtcheck on apid, so if it stops responding, it should print to the console.

@RvRuttenFS
Copy link

We asked ourselves the same question, that's why we said "partially" on purpose. Is there anything else we can check or confirm to make sense of this weird behavior?

After killing the apid proces (through a privileged debug pod) a new apid process starts and that does reset something, as we can then run talosctl again. Same effect as a reboot of the node.

I also noticed 502 Bad Gateway errors sometimes when using talosctl.

Lastly, I now see I forgot to mention we use Omni SaaS - if that makes any difference.

@smira
Copy link
Member

smira commented Mar 22, 2024

If using Omni, you have console logs available in the machine view.

And it'd better to create an issue in the Omni repo. Next release of Omni should have omnictl support command to generate a great support bundle.

@RvRuttenFS
Copy link

Seems we have found something. As I highjacked this issue for a bit it seems fair to update on what caused it for us.

In our cluster patch yaml file in Omni we had debug:true set. But instead of giving us more logging, it would stop giving logs regarding apid's logs (this is actually a bug!). After some time the logs that are not visible would "fill up" something/somewhere/buffer and that caused a freeze in APID - only in Vmware/ESX. After we removed this debug key from the cluster patch, no more strange behavior was observed and the clusters worked again.

Not sure if @Filip7656 has used this debug key too, but if you did - now you know this is something that should not be used.
If not, I hope you will figure out what is causing it for you.

Thanks everyone!

Copy link

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 23, 2024
@askedrelic
Copy link

Anecdotal, but I believe I ran something similar to this after setting debug:true, that was confusing to debug until this issue helped suggest disabling debug. What clued me in was kubectl would still work, but talos gRPC calls would timeout. The dashboard didn't specifically show anything wrong as far as I remember.

If debug logs are crashing apid, it does make it hard to understand the issue, especially when I'm enabling debug logs as a Talos novice.

I'm testing on a Ubuntu 22.04 host, with libvirt/KVM/QEMU and a 4 core, 8GB RAM, 40GB disk guest running Talos 1.7.6.

@github-actions github-actions bot removed the Stale label Sep 25, 2024
@smira
Copy link
Member

smira commented Sep 25, 2024

Please never ever use debug: true unless you have no access to the machine itself over the network, and you're a developer.

image

@askedrelic
Copy link

@smira Thanks for expanding; I wasn't aware of that extra note.

This seems separate from this apid crash, but from the perspective of a new Talos user, the extra note is hard to find or could be written differently if debug mode is discouraged. I've been primarily working with the default talosctl gen config and enabling a debug mode or increasing logging was my first step to understand more. In that config, it's only documented as: debug: false # Enable verbose logging to the console.

@smira
Copy link
Member

smira commented Sep 26, 2024

Yeah, a proper move for us is to hide this knob completely, I'll look into that, certainly not something that people should use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants