-
-
Notifications
You must be signed in to change notification settings - Fork 9
velero-notifications intermittently fails #2
Comments
Hi, does this happen with the latest version from 4 days ago as well? |
@vitobotta Yes the service would randomly hang and stop sending notifications for backups on v1654456872 and earlier this week we upgraded to the latest version v1682280591 and still have same issue. |
can you share the content of the configmap? |
apiVersion: v1 |
Weird, I can't reproduce. Can you uninstall and reinstall and see if it still happens? |
Just uninstalled and reinstalled and still getting the same logs on the previous failed instance. It will continue to send notifications for a while even though it is frequently restarting and then it will eventually hang and stop sending any notifications.
/usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline': EOFError (EOFError) from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:73:in block in parse'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in loop' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:72:in parse'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request'from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:210:in meta_list'from /app/lib/controller.rb:64:in last_resource_version' from /app/lib/controller.rb:92:in rescue in watch_backups'from /app/lib/controller.rb:74:in watch_backups' from /app/lib/controller.rb:43:in start'from app.rb:8:in <main>' /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in sysread_nonblock': Connection reset by peer (Errno::ECONNRESET) (Excon::Error::Socket)from /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in read_nonblock' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:199:in read_nonblock'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:131:in parse'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request'from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:235:in watch'from /app/lib/controller.rb:79:in watch_backups' from /app/lib/controller.rb:43:in start'from app.rb:8:in <main>' /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in sysread_nonblock': Connection reset by peer (Errno::ECONNRESET)from /usr/local/lib/ruby/3.1.0/openssl/buffering.rb:214:in read_nonblock' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:199:in read_nonblock'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:79:in block in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in loop'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/socket.rb:70:in readline' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/response.rb:131:in parse'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/response_parser.rb:7:in response_call' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/middlewares/redirect_follower.rb:82:in response_call'from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:459:in response' from /usr/local/bundle/gems/excon-0.99.0/lib/excon/connection.rb:290:in request'from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/transport.rb:294:in request' from /usr/local/bundle/gems/k8s-ruby-0.14.0/lib/k8s/resource_client.rb:235:in watch'from /app/lib/controller.rb:79:in watch_backups' from /app/lib/controller.rb:43:in start'from app.rb:8:in <main>' PS C:\Git\SharedServices.Infrastructure\src\Infrastructure\Aks-Canoes\deployment\helm>
|
The error seems to suggest some possible disconnections from the Kubernetes API for some reason. Does the configmap still show 0 as the latest backup resource version? |
My configmap.yaml is setup as follows: apiVersion: v1 When I describe it it appears to be a different version: PS C:\Git\SharedServices.Infrastructure\src\Infrastructure\Aks-Canoes\deployment\helm> kubectl describe configmap backups-last-resource-version -n velero Dataresource-version:622578550 BinaryDataEvents: |
What kind of cluster is it? Which version and which CNI do you use? Seems like some kind of network error |
Do you have Prometheus/Grafana installed? Can you check the API server uptime? |
@csimon82 Pull the latest changes and reinstall the chart. I have added an init container that sets some keepalive settings. These should help if there is some network instability for some reason. Let me know how it goes. |
@vitobotta So far looking better with the latest changes...will continue to monitor. BTW this is an Azure Kubernetes cluster version 1.24.6. |
Glad to hear 👍 |
@vitobotta Unfortunately yesterday after I signed off it appeared to stop listening as initially observed. I had 2 different backup schedules setup, one every 5 minutes (which I removed before I left) and one every 60 minutes (aksbackup60minute). You can see in the attached logs it picked up the last hourly backup at "2023-04-27T22:32:38.310398" and then logs just stop at "2023-04-27T22:37:32.124776" and I didn't get another notification after that and would have expected one every hour. |
I haven't come across this issue myself and from the errors you pasted before it would seem some kind of network issue causing disconnections. I was planning to rewrite the controller in Crystal so maybe I'll do it now, maybe there is some issue with the Ruby k8s client that for some reason has surfaced for you. |
@csimon82 @jkanczler I rewrote the controller in Crystal and published a new version. Please try uninstalling and reinstalling the chart and let me know if you still have the same issue. I was unable to reproduce it but maybe there was some problem with the Ruby libraries I was using, so perhaps the Crystal version might do better. Let's see |
I reverted to the Ruby version... was having problems with some dependencies with Crystal. Do you have Prometheus/Grafana or some other monitoring that can tell you the % of failures with the API server? |
I found several reports of problems with the watch API and AKS, with some people suggesting to lower the timeout to less than 4 minutes. I have done that in the latest update, so please try reinstalling the chart and let me know how it goes. |
@vitobotta I've updated to the latest version and as before it does work for a while and then appears to stop catching when the backups are done and sending a notification. It is interesting though now it doesn't appear to just stop logging and I'm getting a bunch of these connection lost messages... I, [2023-05-01T20:46:55.673014 #1] INFO -- : Connection to API lost: EOFError (EOFError) |
Hi @csimon82 - in the meantime I have worked with the developer of the Kubernetes client library for Crystal and he got the autoresume with timeout working nicely. I will switch back to the Crystal version and release a new image tomorrow. Will update this thread when done, so you can try that version. I still haven't been able to reproduce your issue but I have read several reports of problems with streaming with the Kubernetes API in AKS. Let's see if the improvements in the Crystal version help your case. |
@csimon82 I have published the new image with the Crystal version. Please uninstall and reinstall and let me know how it goes. This version automatically reconnects from where it left if there is a disconnection for some reason. It has worked fine for me for one full day with no missed notifications. |
@vitobotta I'm assuming you are using the same version v1.0.0 for the Crystal version now right? I pulled that image into our ACR and have deployed it and have a schedule running every 5 minutes now but the notification service doesn't appear to pick up anything now and I don't see any error logging. |
@csimon82 Yes I am using that version and works fine for me and I have never been able to reproduce your issues with my clusters. Did you redeploy the chart? I changed the update strategy to "Recreate" in purpose (for now) to ensure that the image is updated and the container recreated since I have overwritten the v1.0.0 image. If you perform a backup manually, does it work? I am wondering if it's something about schedules. |
I have installed the chart in 2 other clusters, these ones in GKE to see how it goes. But with my 2 clusters in Hetzner Cloud (using k3s) it has been working great without any issue. |
@vitobotta Yes I redeployed the chart and the strategy is set to "Recreate". Also manual backup did not get picked up by the notification service...the service just seemed to hang and stop logging over an hour ago. |
I don't know what to say. I don't have any problems with it and I also tested it successfully with the two GKE clusters (manual backups for now). If you exec into a pod or watch some resources with kubectl does it work? Are you having these problems only with one cluster? Do you have a chance to try with another cluster? |
@csimon82 hey, is Velero installed in the velero namespace? |
nvm I saw it from the pic |
@vitobotta I've been doing some more testing this morning and am seeing some exceptions. |
Can you please set |
@vitobotta Email is the only one we are using, it is configured like the following. Also seeing another exception: Also no I'm not seeing any logs indicating that the manual backups are picked up. |
I'm having the same issue as @csimon82 for a while now. Same error for 2 clusters (running in digital ocean) |
From time to time, the velero-notifications pod fails and after it just hangs, it stops producing logs.
The last log this time:
The text was updated successfully, but these errors were encountered: