Skip to content
This repository has been archived by the owner on Sep 25, 2024. It is now read-only.

velero-notifications intermittently fails #2

Open
jkanczler opened this issue Apr 27, 2023 · 33 comments
Open

velero-notifications intermittently fails #2

jkanczler opened this issue Apr 27, 2023 · 33 comments

Comments

@jkanczler
Copy link

jkanczler commented Apr 27, 2023

From time to time, the velero-notifications pod fails and after it just hangs, it stops producing logs.

The last log this time:

INFO -- : Reconnecting to API...
/usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in `block in readline': EOFError (EOFError) (Excon::Error::Socket)
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in `loop'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in `readline'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:73:in `block in parse'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in `loop'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in `parse'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in `response_call'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in `response_call'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in `response'
        from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in `request'
        from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in `request'
        from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:210:in `meta_list'
        from /app/lib/controller.rb:64:in `last_resource_version'
        from /app/lib/controller.rb:92:in `rescue in watch_backups'
        from /app/lib/controller.rb:74:in `watch_backups'
        from /app/lib/controller.rb:43:in `start'
        from app.rb:8:in `<main>'
@vitobotta
Copy link
Owner

Hi, does this happen with the latest version from 4 days ago as well?

@csimon82
Copy link

@vitobotta Yes the service would randomly hang and stop sending notifications for backups on v1654456872 and earlier this week we upgraded to the latest version v1682280591 and still have same issue.

@vitobotta
Copy link
Owner

can you share the content of the configmap?

@csimon82
Copy link

apiVersion: v1
kind: ConfigMap
metadata:
name: backups-last-resource-version
data:
resource-version: "0"

@vitobotta
Copy link
Owner

Weird, I can't reproduce. Can you uninstall and reinstall and see if it still happens?

@csimon82
Copy link

Just uninstalled and reinstalled and still getting the same logs on the previous failed instance. It will continue to send notifications for a while even though it is frequently restarting and then it will eventually hang and stop sending any notifications.

PS C:\Git\SharedServices.Infrastructure\src\Infrastructure\Aks-Canoes\deployment\helm> kubectl get po -n velero NAME READY STATUS RESTARTS AGE velero-674cdc5fbb-mjstt 1/1 Running 0 22d velero-backup-notification-bb9dd5d69-9n9vm 1/1 Running 1 (7m56s ago) 13m PS C:\Git\SharedServices.Infrastructure\src\Infrastructure\Aks-Canoes\deployment\helm> kubectl logs velero-backup-notification-bb9dd5d69-9n9vm -n velero --previous I, [2023-04-27T20:33:04.496708 #1] INFO -- K8s::Transport<https://172.32.0.1:443>: GET /api/v1 => HTTP 200: <K8s::API::MetaV1::APIResourceList> in 0.026s I, [2023-04-27T20:33:04.502292 #1] INFO -- K8s::Transport<https://172.32.0.1:443>: GET /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: <K8s::Resource> in 0.005s I, [2023-04-27T20:33:04.509669 #1] INFO -- K8s::Transport<https://172.32.0.1:443>: GET /apis/velero.io/v1 => HTTP 200: <K8s::API::MetaV1::APIResourceList> in 0.006s I, [2023-04-27T20:33:04.544956 #1] INFO -- K8s::Transport<https://172.32.0.1:443>: GET /apis/velero.io/v1/namespaces/velero/backups => HTTP 200: <K8s::API::MetaV1::List> in 0.032s I, [2023-04-27T20:33:04.545063 #1] INFO -- : Resource version for backups: 622511674 I, [2023-04-27T20:33:04.561275 #1] INFO -- K8s::Transport<https://172.32.0.1:443>: PATCH /api/v1/namespaces/velero/configmaps/backups-last-resource-version <Hash> => HTTP 200: <K8s::Resource> in 0.016s I, [2023-04-27T20:33:04.561401 #1] INFO -- : Watching backups (current resource version 622511674)... I, [2023-04-27T20:38:06.044028 #1] INFO -- : Connection to API lost: Connection reset by peer (Errno::ECONNRESET) I, [2023-04-27T20:38:06.044108 #1] INFO -- : Reconnecting to API... /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline': EOFError (EOFError) (Excon::Error::Socket)
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:73:in block in parse' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in loop'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in parse' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request'
from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:210:in meta_list' from /app/lib/controller.rb:64:in last_resource_version'
from /app/lib/controller.rb:92:in rescue in watch_backups' from /app/lib/controller.rb:74:in watch_backups'
from /app/lib/controller.rb:43:in start' from app.rb:8:in

'
/usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline': EOFError (EOFError) from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:73:in block in parse'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in loop' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in parse'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request'
from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:210:in meta_list'
from /app/lib/controller.rb:64:in last_resource_version' from /app/lib/controller.rb:92:in rescue in watch_backups'
from /app/lib/controller.rb:74:in watch_backups' from /app/lib/controller.rb:43:in start'
from app.rb:8:in <main>' /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in sysread_nonblock': Connection reset by peer (Errno::ECONNRESET) (Excon::Error::Socket)
from /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in read_nonblock' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:199:in read_nonblock'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:131:in parse'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request'
from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:235:in watch'
from /app/lib/controller.rb:79:in watch_backups' from /app/lib/controller.rb:43:in start'
from app.rb:8:in <main>' /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in sysread_nonblock': Connection reset by peer (Errno::ECONNRESET)
from /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in read_nonblock' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:199:in read_nonblock'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:131:in parse'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call'
from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request'
from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:235:in watch'
from /app/lib/controller.rb:79:in watch_backups' from /app/lib/controller.rb:43:in start'
from app.rb:8:in <main>' PS C:\Git\SharedServices.Infrastructure\src\Infrastructure\Aks-Canoes\deployment\helm>

@vitobotta
Copy link
Owner

The error seems to suggest some possible disconnections from the Kubernetes API for some reason. Does the configmap still show 0 as the latest backup resource version?

@csimon82
Copy link

My configmap.yaml is setup as follows:

apiVersion: v1
kind: ConfigMap
metadata:
name: backups-last-resource-version
data:
resource-version: "0"

When I describe it it appears to be a different version:

PS C:\Git\SharedServices.Infrastructure\src\Infrastructure\Aks-Canoes\deployment\helm> kubectl describe configmap backups-last-resource-version -n velero
Name: backups-last-resource-version
Namespace: velero
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: velero-notifications
meta.helm.sh/release-namespace: velero

Data

resource-version:

622578550

BinaryData

Events:

@csimon82
Copy link

Sorry for the weird formatting not sure why it did that:
image

@vitobotta
Copy link
Owner

What kind of cluster is it? Which version and which CNI do you use? Seems like some kind of network error

@vitobotta
Copy link
Owner

Do you have Prometheus/Grafana installed? Can you check the API server uptime?

@vitobotta
Copy link
Owner

@csimon82 Pull the latest changes and reinstall the chart. I have added an init container that sets some keepalive settings. These should help if there is some network instability for some reason. Let me know how it goes.

@csimon82
Copy link

@vitobotta So far looking better with the latest changes...will continue to monitor. BTW this is an Azure Kubernetes cluster version 1.24.6.

@vitobotta
Copy link
Owner

Glad to hear 👍

@csimon82
Copy link

csimon82 commented Apr 28, 2023

@vitobotta Unfortunately yesterday after I signed off it appeared to stop listening as initially observed. I had 2 different backup schedules setup, one every 5 minutes (which I removed before I left) and one every 60 minutes (aksbackup60minute). You can see in the attached logs it picked up the last hourly backup at "2023-04-27T22:32:38.310398" and then logs just stop at "2023-04-27T22:37:32.124776" and I didn't get another notification after that and would have expected one every hour.
image
veleroNotificationOutput.txt

@vitobotta
Copy link
Owner

I haven't come across this issue myself and from the errors you pasted before it would seem some kind of network issue causing disconnections. I was planning to rewrite the controller in Crystal so maybe I'll do it now, maybe there is some issue with the Ruby k8s client that for some reason has surfaced for you.

@vitobotta
Copy link
Owner

@csimon82 @jkanczler I rewrote the controller in Crystal and published a new version. Please try uninstalling and reinstalling the chart and let me know if you still have the same issue. I was unable to reproduce it but maybe there was some problem with the Ruby libraries I was using, so perhaps the Crystal version might do better. Let's see

@vitobotta
Copy link
Owner

I reverted to the Ruby version... was having problems with some dependencies with Crystal. Do you have Prometheus/Grafana or some other monitoring that can tell you the % of failures with the API server?

@vitobotta
Copy link
Owner

I found several reports of problems with the watch API and AKS, with some people suggesting to lower the timeout to less than 4 minutes. I have done that in the latest update, so please try reinstalling the chart and let me know how it goes.

@csimon82
Copy link

csimon82 commented May 1, 2023

@vitobotta I've updated to the latest version and as before it does work for a while and then appears to stop catching when the backups are done and sending a notification. It is interesting though now it doesn't appear to just stop logging and I'm getting a bunch of these connection lost messages...

I, [2023-05-01T20:46:55.673014 #1] INFO -- : Connection to API lost: EOFError (EOFError)
I, [2023-05-01T20:46:55.673095 #1] INFO -- : Reconnecting to API...
I, [2023-05-01T20:46:55.860282 #1] INFO -- : Resource version for backups: 631217053
I, [2023-05-01T20:46:55.860201 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /apis/velero.io/v1/namespaces/velero/backups => HTTP 200: K8s::API::MetaV1::List in 0.172s
I, [2023-05-01T20:46:55.876809 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: PATCH /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.016s
I, [2023-05-01T20:46:55.883090 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.006s
I, [2023-05-01T20:46:55.883346 #1] INFO -- : Watching backups (current resource version 631217053)...
I, [2023-05-01T20:49:55.772924 #1] INFO -- : Connection to API lost: EOFError (EOFError)
I, [2023-05-01T20:49:55.773167 #1] INFO -- : Reconnecting to API...
I, [2023-05-01T20:49:55.901905 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /apis/velero.io/v1/namespaces/velero/backups => HTTP 200: K8s::API::MetaV1::List in 0.115s
I, [2023-05-01T20:49:55.901989 #1] INFO -- : Resource version for backups: 631221689
I, [2023-05-01T20:49:55.914639 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: PATCH /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.012s
I, [2023-05-01T20:49:55.919181 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.004s
I, [2023-05-01T20:49:55.919350 #1] INFO -- : Watching backups (current resource version 631221689)...
I, [2023-05-01T20:52:55.872918 #1] INFO -- : Connection to API lost: EOFError (EOFError)
I, [2023-05-01T20:52:55.873005 #1] INFO -- : Reconnecting to API...
I, [2023-05-01T20:52:56.010801 #1] INFO -- : Resource version for backups: 631226063
I, [2023-05-01T20:52:56.010714 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /apis/velero.io/v1/namespaces/velero/backups => HTTP 200: K8s::API::MetaV1::List in 0.121s
I, [2023-05-01T20:52:56.028053 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: PATCH /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.017s
I, [2023-05-01T20:52:56.033926 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.006s
I, [2023-05-01T20:52:56.034141 #1] INFO -- : Watching backups (current resource version 631226063)...
I, [2023-05-01T20:55:55.972648 #1] INFO -- : Connection to API lost: EOFError (EOFError)
I, [2023-05-01T20:55:55.972722 #1] INFO -- : Reconnecting to API...
I, [2023-05-01T20:55:56.540660 #1] INFO -- : Resource version for backups: 631230547
I, [2023-05-01T20:55:56.540525 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /apis/velero.io/v1/namespaces/velero/backups => HTTP 200: K8s::API::MetaV1::List in 0.550s
I, [2023-05-01T20:55:56.572368 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: PATCH /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.031s
I, [2023-05-01T20:55:56.577617 #1] INFO -- K8s::Transporthttps://172.32.0.1:443: GET /api/v1/namespaces/velero/configmaps/backups-last-resource-version => HTTP 200: K8s::Resource in 0.005s
I, [2023-05-01T20:55:56.577800 #1] INFO -- : Watching backups (current resource version 631230547)...

@vitobotta
Copy link
Owner

Hi @csimon82 - in the meantime I have worked with the developer of the Kubernetes client library for Crystal and he got the autoresume with timeout working nicely. I will switch back to the Crystal version and release a new image tomorrow. Will update this thread when done, so you can try that version. I still haven't been able to reproduce your issue but I have read several reports of problems with streaming with the Kubernetes API in AKS. Let's see if the improvements in the Crystal version help your case.

@vitobotta
Copy link
Owner

@csimon82 I have published the new image with the Crystal version. Please uninstall and reinstall and let me know how it goes. This version automatically reconnects from where it left if there is a disconnection for some reason. It has worked fine for me for one full day with no missed notifications.

@csimon82
Copy link

csimon82 commented May 2, 2023

@vitobotta I'm assuming you are using the same version v1.0.0 for the Crystal version now right? I pulled that image into our ACR and have deployed it and have a schedule running every 5 minutes now but the notification service doesn't appear to pick up anything now and I don't see any error logging.
image
Any thoughts on this?

@vitobotta
Copy link
Owner

@csimon82 Yes I am using that version and works fine for me and I have never been able to reproduce your issues with my clusters. Did you redeploy the chart? I changed the update strategy to "Recreate" in purpose (for now) to ensure that the image is updated and the container recreated since I have overwritten the v1.0.0 image.

If you perform a backup manually, does it work? I am wondering if it's something about schedules.

@vitobotta
Copy link
Owner

I have installed the chart in 2 other clusters, these ones in GKE to see how it goes. But with my 2 clusters in Hetzner Cloud (using k3s) it has been working great without any issue.

@csimon82
Copy link

csimon82 commented May 2, 2023

@vitobotta Yes I redeployed the chart and the strategy is set to "Recreate".
image

Also manual backup did not get picked up by the notification service...the service just seemed to hang and stop logging over an hour ago.
image

@vitobotta
Copy link
Owner

I don't know what to say. I don't have any problems with it and I also tested it successfully with the two GKE clusters (manual backups for now). If you exec into a pod or watch some resources with kubectl does it work? Are you having these problems only with one cluster? Do you have a chance to try with another cluster?

@vitobotta
Copy link
Owner

@csimon82 hey, is Velero installed in the velero namespace?

@vitobotta
Copy link
Owner

nvm I saw it from the pic

@csimon82
Copy link

csimon82 commented May 3, 2023

@vitobotta I've been doing some more testing this morning and am seeing some exceptions.
image
Also worth noting I see some info messages about ensuring the environment variables are all set...currently I have the EMAIL_SMTP_USERNAME and EMAIL_SMTP_PASSWORD values commented out since those aren't setup on our SMTP server (this worked fine before so don't think it's an issue but thought I'd mention it.)
image

@vitobotta
Copy link
Owner

Can you please set email.enabled to false if you are not using it? There might be a problem with exception handling. Also disable any medium you are not using, reinstall and try a manual backup. Does the tool pick up those events for the manual backup?

@csimon82
Copy link

csimon82 commented May 3, 2023

@vitobotta Email is the only one we are using, it is configured like the following.
image

Also seeing another exception:
image

Also no I'm not seeing any logs indicating that the manual backups are picked up.

@yerbaszen
Copy link

I'm having the same issue as @csimon82 for a while now.

image

Same error for 2 clusters (running in digital ocean)
I've uninstalled / reinstalled the chart, same outcome.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants