-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad gateway failures of gRPC #156
Comments
@cvetkovic, can you please take a look and tell if you have some input on these error messages? |
These errors were caused by knative terminating pods that were handling requests. This was in the queueproxy container logs of one of the pods:
Four other pods were terminated at the same time (scaling down from initial burst of pod deployments). And five gRPC failures were registered at 16:34:02.910 UTC. Such errors might surface when we have bursty invocation pattern: scaling down after burst ended might terminate pod handling the request. The question is, how autoscaler should detect working pods and not terminate them before they finish? @anshalshukla, do you have something on that matter from theoretical standpoint? |
What do you use for sandboxing? containerd or Firecracker? |
Containerd. Configuration is running |
I have personally never had any issues like what you posted with containerd, and I set up a cluster last week that still works without any issues. With Firecracker, I did have a lot of gRPC timeout as the uVMs got killed for some reason I never managed to explain. Probably a buggy implementation of the kubelet <-> Firecracker-containerd interface. Another thing that I think about maybe your cluster is too small to make everything fit. I see that the PR that updates Knative to v1.9 plays with tainting and that may have implications. As per your question about how running pods are detected, I think the activator takes care of that, i.e. it has the list of all sandbox endpoints that are provided by the Kubernetes endpoint controller. The autoscaler just issues a decision to upscale or downscale to some number of instances. |
I don't think that this has something to do with the size of the cluster. For attached trace, at most it used 25 pods with 83m CPU limit, so ~2 cores total. |
Version & Branch
main
branch, default configuration inscripts/setup/setup.cfg
Expected Behaviour
No failures in calls
Actual Behaviour
Sometimes, ~10% of invocations fails with error
gRPC timeout exceeded for function trace-func-0-18129329275763004641 - rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "text/plain; charset=utf-8"
Steps to Reproduce the Problem
durations.csv
invocations.csv
memory.csv
Additional Info
Error observed on number of clusters and traces.
Seems that this issue is connected to #129. Before failing with
"Invalid argument provided..."
there are messages telling that there are many failures, that happens to be due to this error.Sometimes, error changes to
504 (Gateway Timeout)
orunexpected EOF
.Most of the failures happen in the first minutes, but it continues to fail throughout the experiment. Logs from the loader is attached as well.
loader_trace.txt
The text was updated successfully, but these errors were encountered: