-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to make queue-proxy wait for long requests to finish? #15649
Comments
Hi pls take a look at this issue. This is influenced by the revision timeout. |
Yes, per my tests the revision timeout changes the |
Here is an example that illustrates my scenario. I set the timeoutSeconds field on the Knative Service resource to 600 (10 min). For simplicity I created an application that sleeps for 50s and finishes the requests. Here is the application log for 2 subsequent requests:
A few seconds after the second request started being handled, I deleted the pod. The pod 10 min termination grace period allowed it to finish handling the request properly, but the http client received an unexpected EOF error. Checking the queue-proxy log for that period, I see:
At 11:34:21 it received the TERM signal, then it says "Sleeping 30s to allow K8s propagation of non-ready state". I wish I could control that 30s sleep time, or even have the queue-proxy monitor the running requests and only terminates when they're all finished, or the termination grace period has ended. |
Afaik that is the case. Requests are drained and when there is a new request while draining the timer is restarted. I guess the problem described is with already served requests before draining happens (SIGTERM is received). This seems to be similar to #13953 (comment), but it was not reproducible back then I will test this again cc @dprotaso |
@lsergio hi, could you provide more details? I was not able to reproduce the following:
When I update the ksvc the existing revision is still working as expected and requests are served. |
@skonto the issue I had with updating revisions is that when the new revision is ready, the pod for the old one is terminated and then I get the same scenario as described in #15649 (comment) if there are on-going long duration requests. Did you manage to reproduce the scenario described in that comment? |
btw, I am using knative serving 1.15.2. If you don't face the issue with a newer release, I may try it as well. |
Hi @lsergio. I can reproduce that scenario, the connection between QP and the app is being killed. Your app may continue until termination grace period expires if it does not handle signals and runs with pid=1. Could you share the code and how you build the app container? Here is what I got:
In any case you need to handle the SIGTERM signal and shutdown your application server gracefully (golang shutdown drains app requests), note that QP does that:
Here as an example where I am using the autoscale-go sample app patched to handle the signals.
To summarize you need to handle signals independently of the programming language in the app container and make sure that the app in the container receives the signal in the first place e.g use exec format in the dockerfile, some init process etc. |
Hi @skonto. My application is being built with Apache Camel k, using the following code:
With this config the created application container will handle the SIGTERM signal and wait for up to 600 seconds for the existing requests to finish before terminating the container.
And here is a timeline for my test:
|
I was not able to reproduce it with the autoscale sample app and by following the steps above on the terminal. The second request always goes to the newly created pod. I am not sure how the Camel K extension works under the hood and if it shutdowns the netty stuff as expected but I see that this warning This happens at Could you turn the debug level on at the app side, preferably the http protocol stuff, probably something like (haven't tested it) under Camel properties eg. In addition, looking at your logs in #15649 (comment), the second request goes in before SIGTERM is received and QP gets in draining mode. In that mode QP waits for all requests to finish (pre-stop is called):
At first glance there is no reason for the second request not to be handled by QP unless the app kills the connection. |
Hi @skonto . I'll try it with a non camel k app. But anyway, I see you mentioned:
But the problem is not actually related to requests that are started after the new pod is ready. The problem happens with requests that are being processed by a pod that is terminated. the qp waits for 30s and shuts down, even if the ongoing request is not still finished. Per my tests with the camel k app, it looks like the application is receiving the SIGTERM signal only after the QP is fully terminated. So the aplication is still operating while the QP is handling the signal. |
here are some logs where there is no request being processed: QP proxy log:
At 11:28:40 the queue proxy receives the TERM signal, sleeps for 30s and finishes shutting down at 11:29:10. And this is the application log:
Shutdown starts at 11:29:10, right after the qp proxy fully terminates. If I repeat the tests and submit some long requests, I see the same behavior where the application only gets the termination signal after the QP is terminated. And the existing requests on the the terminating pod that do not finish within the 30s timeframe will fail, even though the termination grace period is way longer. |
Hi @lsergio,
I know I am just reporting that the steps are not enough to reproduce it. Probably it requires to test via code to actually hit the terminated pod. However, even if you hit the terminated pod during draining the timer will be reset and the request should be processed. QP will not terminate if there are pending requests and connections are not broken. In any case, If the camelk application is shutting down the http server early without a graceful shutdown (it seems so and there is an exception that needs debugging) of the connections, as in the issue I pointed, then you will have the behavior you observe. Pls discuss this at the issue quarkusio/quarkus#18890 (comment) I pointed at the camel project side first. It seems there is a known issue there. |
Hi @skonto . I created a golang app to test this scenario without camel k and indeed the queue-proxy is waiting for the requests to finish. I'll figure it out now what's wrong with the camel application. |
Ask your question here:
Let's say I have a KnativeService for an application that is slow and will take around 50s to handle each request. That application is running and requests are currently being handled.
Then someone updates the KnativeService in the middle of request processing, and this causes the existing pod to be terminated. I have configured the
terminationGracePeriodSeconds
so that the application has enough time to complete the existing long requests.But after 30s the
queue-proxy
container is terminated and the requests are closed on the client side with an unexpected EOF, although on the application side they are able to finish successfully.Is there some way to extend that 30s queue-proxy timeout or configure it to terminate only after the application container has finished?
The text was updated successfully, but these errors were encountered: