Control plane API stops responding #5072

gix · 2022-03-03T11:31:04Z

Bug Report

Description

I've set up a control plane according to the docs for vmware. The only thing changed in controlplane.yaml is static network configuration. The node boots up correctly, but after some time stops responding.

talosctl just hangs for any command without any output. I can still ping the node, and kubectl get pods works and shows the default pods as running. The VM shows no errors in its console. Logs sent to a TCP endpoint also do not show any errors. After a reboot it seems to work again, but running talosctl health a few times and the node stops responding again after a few minutes. Even if left alone this seems to happen after a day or so.

Logs

Not sure what I should append here. A rebooted node doesn't seem to show logs from the previous run, and once this state is reached I cannot get any logs.

Environment

Talos version: v0.14.1
Kubernetes version: v1.23.0 (Client), v1.23.1 (Server)
Platform: VMware ESXi 6.7

The text was updated successfully, but these errors were encountered:

smira · 2022-03-03T12:50:22Z

It might be helpful to access console logs or video output of the VM once it is in this "hanging" state.

Talos can also stream the logs at least to the moment it hangs via talosctl dmesg -f.

My only guess is that VM doesn't have enough resources to run the apid process, but this is a wild guess.

gix · 2022-03-07T11:10:59Z

I've doubled the resources for the test VM from ones states in the docs to 8GB memory and 20GB disk space. After reboot I did a health check every 10 seconds. 13 succeeded, the 14th got stuck at 11:55:10 with:

discovered nodes: control plane: ["10.1.0.191"], worker: []
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The dmesg -f continues afterwards, showing only NTP messages. Every newly submitted talosctl command hangs.

Output from dmesg -f: dmesg-f.log
Logs received by a TCP collector: received.log

smira · 2022-03-09T16:44:58Z

I don't see anything in the logs which my point towards the problem. My only guess is that CNI messes up networking in some way? Or some other privileged workload?

I don't see any problem from Talos side right now.

gix · 2022-03-10T09:17:16Z

It looks like this depends on the number of connections. The VM can run unused overnight and still accept connections the next day. But a talosctl support after a reboot will hang midway through. Is there any way to get more debug output in the console (in addition to debug: true? A debug build of talos?

smira · 2022-03-10T13:03:30Z

we don't have any specific way to do more debugging. one way might be to schedule a privileged pod on the node via Kubernetes and try regular Linux troubleshooting if there's any sign of some resource exhaustion. I'm not sure where to even look at right now.

In terms of resource usage talosctl dashboard might help. As for the API connections, talosctl logs -f apid and talosctl logs -f machined might help.

Filip7656 · 2023-12-10T11:04:33Z

I have similar issue described here #8049, exactly same conditions and also talos api hangs. Overnights everything was fine then the next day I ran talosctl health and it hanged on disk checking. After that i couldnt access other talosctl commands.
I think it has something to do with VMware networking.

RvRuttenFS · 2024-03-22T08:56:40Z

We also experience the same thing in VMWare. Using talosctl services on a watch will also end up in the same state after a couple of minutes. Looked at #8049 too and have the same log output (or lack of). We used OVA 1.6.6.

We tried E1000, E1000E and VMXNET3 NIC types to rule out issues there.

We managed to stop the apid proces/container and when it restarts the problem is "reset" the same way a reboot resets it... until the next day / talosctl services or talosctl health.

Any other suggestions?

Filip7656 · 2024-03-22T09:20:48Z

@RvRuttenFS Have you tried with older talos versions? 1.6.0 etc.? Try also OVA 1.6.7 (released two days ago)
This issue took two weeks to resolve, and my solution was upgrading to newer version which had newer linux kernel.

RvRuttenFS · 2024-03-22T09:58:12Z

Thanks for your suggestion. Yes, we tried OVA 1.6.0, 1.65, 1.6.6 and 1.6.7.
Also we tried many settings like Static IP and DHCP, flannel and cillium CNI, NTP on and off.

We think apid is somehow partially crashing (or maybe some other components behind it).
https://www.talos.dev/v1.6/learn-more/components/#components

smira · 2024-03-22T10:06:40Z

If apid is crashing, you will see it in the logs.

Quick check for apid from outside is to do talosctl --endpoint IP --nodes IP version, this API should always respond as long as apid is still listening.

Filip7656 · 2024-03-22T10:08:31Z

So you have tried all the stuff I did, :/ What version of vmware you are running? Have you tried to install it from ISO and not from OVA? (be sure to change disk settings in order for them to be seen by talos)

RvRuttenFS · 2024-03-22T10:14:05Z

@smira
Ran the talosctl command and got back:

Client:
	Tag:         v1.6.2
	SHA:         26eee755
	Built:
	Go version:  go1.21.6 X:loopvar
	OS/Arch:     darwin/arm64
Server:

So no info/response there.

Looking at the log (talosctl logs apid --tail -1) only shows 1 older entry.

Any suggestions where or what kind of other logs may help in this?

smira · 2024-03-22T10:37:12Z

I'm confused - how can you access apid logs if you can't access API?

Look at the console/serial logs. If nothing there, I'd assume it's not Talos.

Talos does a healtcheck on apid, so if it stops responding, it should print to the console.

RvRuttenFS · 2024-03-22T10:49:08Z

We asked ourselves the same question, that's why we said "partially" on purpose. Is there anything else we can check or confirm to make sense of this weird behavior?

After killing the apid proces (through a privileged debug pod) a new apid process starts and that does reset something, as we can then run talosctl again. Same effect as a reboot of the node.

I also noticed 502 Bad Gateway errors sometimes when using talosctl.

Lastly, I now see I forgot to mention we use Omni SaaS - if that makes any difference.

smira · 2024-03-22T10:57:59Z

If using Omni, you have console logs available in the machine view.

And it'd better to create an issue in the Omni repo. Next release of Omni should have omnictl support command to generate a great support bundle.

RvRuttenFS · 2024-03-26T21:24:02Z

Seems we have found something. As I highjacked this issue for a bit it seems fair to update on what caused it for us.

In our cluster patch yaml file in Omni we had debug:true set. But instead of giving us more logging, it would stop giving logs regarding apid's logs (this is actually a bug!). After some time the logs that are not visible would "fill up" something/somewhere/buffer and that caused a freeze in APID - only in Vmware/ESX. After we removed this debug key from the cluster patch, no more strange behavior was observed and the clusters worked again.

Not sure if @Filip7656 has used this debug key too, but if you did - now you know this is something that should not be used.
If not, I hope you will figure out what is causing it for you.

Thanks everyone!

github-actions · 2024-09-23T02:02:25Z

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

askedrelic · 2024-09-25T00:23:18Z

Anecdotal, but I believe I ran something similar to this after setting debug:true, that was confusing to debug until this issue helped suggest disabling debug. What clued me in was kubectl would still work, but talos gRPC calls would timeout. The dashboard didn't specifically show anything wrong as far as I remember.

If debug logs are crashing apid, it does make it hard to understand the issue, especially when I'm enabling debug logs as a Talos novice.

I'm testing on a Ubuntu 22.04 host, with libvirt/KVM/QEMU and a 4 core, 8GB RAM, 40GB disk guest running Talos 1.7.6.

smira · 2024-09-25T09:28:08Z

Please never ever use debug: true unless you have no access to the machine itself over the network, and you're a developer.

askedrelic · 2024-09-26T01:55:17Z

@smira Thanks for expanding; I wasn't aware of that extra note.

This seems separate from this apid crash, but from the perspective of a new Talos user, the extra note is hard to find or could be written differently if debug mode is discouraged. I've been primarily working with the default talosctl gen config and enabling a debug mode or increasing logging was my first step to understand more. In that config, it's only documented as: debug: false # Enable verbose logging to the console.

smira · 2024-09-26T10:35:28Z

Yeah, a proper move for us is to hide this knob completely, I'll look into that, certainly not something that people should use.

github-actions bot added the Stale label Sep 23, 2024

github-actions bot removed the Stale label Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control plane API stops responding #5072

Control plane API stops responding #5072

gix commented Mar 3, 2022

smira commented Mar 3, 2022

gix commented Mar 7, 2022

smira commented Mar 9, 2022

gix commented Mar 10, 2022

smira commented Mar 10, 2022

Filip7656 commented Dec 10, 2023

RvRuttenFS commented Mar 22, 2024

Filip7656 commented Mar 22, 2024

RvRuttenFS commented Mar 22, 2024

smira commented Mar 22, 2024

Filip7656 commented Mar 22, 2024

RvRuttenFS commented Mar 22, 2024 •

edited

Loading

smira commented Mar 22, 2024

RvRuttenFS commented Mar 22, 2024

smira commented Mar 22, 2024

RvRuttenFS commented Mar 26, 2024

github-actions bot commented Sep 23, 2024

askedrelic commented Sep 25, 2024

smira commented Sep 25, 2024

askedrelic commented Sep 26, 2024

smira commented Sep 26, 2024

Control plane API stops responding #5072

Control plane API stops responding #5072

Comments

gix commented Mar 3, 2022

Bug Report

Description

Logs

Environment

smira commented Mar 3, 2022

gix commented Mar 7, 2022

smira commented Mar 9, 2022

gix commented Mar 10, 2022

smira commented Mar 10, 2022

Filip7656 commented Dec 10, 2023

RvRuttenFS commented Mar 22, 2024

Filip7656 commented Mar 22, 2024

RvRuttenFS commented Mar 22, 2024

smira commented Mar 22, 2024

Filip7656 commented Mar 22, 2024

RvRuttenFS commented Mar 22, 2024 • edited Loading

smira commented Mar 22, 2024

RvRuttenFS commented Mar 22, 2024

smira commented Mar 22, 2024

RvRuttenFS commented Mar 26, 2024

github-actions bot commented Sep 23, 2024

askedrelic commented Sep 25, 2024

smira commented Sep 25, 2024

askedrelic commented Sep 26, 2024

smira commented Sep 26, 2024

RvRuttenFS commented Mar 22, 2024 •

edited

Loading