Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make queue-proxy wait for long requests to finish? #15649

Closed
lsergio opened this issue Dec 12, 2024 · 14 comments
Closed

How to make queue-proxy wait for long requests to finish? #15649

lsergio opened this issue Dec 12, 2024 · 14 comments
Labels
kind/question Further information is requested

Comments

@lsergio
Copy link

lsergio commented Dec 12, 2024

Ask your question here:

Let's say I have a KnativeService for an application that is slow and will take around 50s to handle each request. That application is running and requests are currently being handled.

Then someone updates the KnativeService in the middle of request processing, and this causes the existing pod to be terminated. I have configured the terminationGracePeriodSeconds so that the application has enough time to complete the existing long requests.

But after 30s the queue-proxy container is terminated and the requests are closed on the client side with an unexpected EOF, although on the application side they are able to finish successfully.

Is there some way to extend that 30s queue-proxy timeout or configure it to terminate only after the application container has finished?

@lsergio lsergio added the kind/question Further information is requested label Dec 12, 2024
@skonto
Copy link
Contributor

skonto commented Dec 13, 2024

Hi pls take a look at this issue. This is influenced by the revision timeout.

@lsergio
Copy link
Author

lsergio commented Dec 13, 2024

Yes, per my tests the revision timeout changes the terminationGracePeriodSeconds for the service pod, but as soon as the queue-proxy gets the SIGTERM signal, it waits for at most 30s before shutting down and closing the long-running connections.

@lsergio
Copy link
Author

lsergio commented Dec 16, 2024

Here is an example that illustrates my scenario. I set the timeoutSeconds field on the Knative Service resource to 600 (10 min).

For simplicity I created an application that sleeps for 50s and finishes the requests.

Here is the application log for 2 subsequent requests:

2024-12-16 11:32:43,751 INFO  [route1] (executor-thread-1) Sleeping for 50 seconds
2024-12-16 11:33:33,755 INFO  [route1] (executor-thread-1) Finished!!!
2024-12-16 11:34:16,721 INFO  [route1] (executor-thread-1) Sleeping for 50 seconds
2024-12-16 11:35:06,722 INFO  [route1] (executor-thread-1) Finished!!!

A few seconds after the second request started being handled, I deleted the pod. The pod 10 min termination grace period allowed it to finish handling the request properly, but the http client received an unexpected EOF error.

Checking the queue-proxy log for that period, I see:

{"severity":"INFO","timestamp":"2024-12-16T11:30:31.726619187Z","logger":"queueproxy","caller":"sharedmain/main.go:277","message":"Starting http server admin:8022","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:21.090616826Z","logger":"queueproxy","caller":"sharedmain/main.go:305","message":"Received TERM signal, attempting to gracefully shutdown servers.","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:21.090676717Z","logger":"queueproxy","caller":"sharedmain/main.go:306","message":"Sleeping 30s to allow K8s propagation of non-ready state","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:21.094294769Z","logger":"queueproxy","caller":"sharedmain/handlers.go:109","message":"Attached drain handler from user-container&{GET /wait-for-drain HTTP/1.1 1 1 map[Accept:[*/*] Accept-Encoding:[gzip] User-Agent:[kube-lifecycle/1.30+]] {} <nil> 0 [] false 10.10.78.191:8022 map[] map[] <nil> map[] 10.251.147.69:47034 /wait-for-drain <nil> <nil> <nil> 0xc0002242d0 0xc00007e6c0 [] map[]}","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:51.09157874Z","logger":"queueproxy","caller":"sharedmain/main.go:310","message":"Shutting down server: metrics","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:51.091848575Z","logger":"queueproxy","caller":"sharedmain/main.go:310","message":"Shutting down server: main","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:51.091926097Z","logger":"queueproxy","caller":"sharedmain/main.go:310","message":"Shutting down server: admin","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}
{"severity":"INFO","timestamp":"2024-12-16T11:34:51.096524587Z","logger":"queueproxy","caller":"sharedmain/main.go:322","message":"Shutdown complete, exiting...","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}

At 11:34:21 it received the TERM signal, then it says "Sleeping 30s to allow K8s propagation of non-ready state".
After these 30s have passed, it shuts down the servers and finishes the container, which causes the EOF, as the second request is still being handled.

I wish I could control that 30s sleep time, or even have the queue-proxy monitor the running requests and only terminates when they're all finished, or the termination grace period has ended.

@skonto
Copy link
Contributor

skonto commented Dec 16, 2024

I wish I could control that 30s sleep time, or even have the queue-proxy monitor the running requests and only terminates when they're all finished, or the termination grace period has ended.

Afaik that is the case. Requests are drained and when there is a new request while draining the timer is restarted. I guess the problem described is with already served requests before draining happens (SIGTERM is received). This seems to be similar to #13953 (comment), but it was not reproducible back then I will test this again cc @dprotaso

@skonto
Copy link
Contributor

skonto commented Dec 19, 2024

@lsergio hi, could you provide more details? I was not able to reproduce the following:

Then someone updates the KnativeService in the middle of request processing, and this causes the existing pod to be terminated.

When I update the ksvc the existing revision is still working as expected and requests are served.

@lsergio
Copy link
Author

lsergio commented Dec 20, 2024

@skonto the issue I had with updating revisions is that when the new revision is ready, the pod for the old one is terminated and then I get the same scenario as described in #15649 (comment) if there are on-going long duration requests. Did you manage to reproduce the scenario described in that comment?

@lsergio
Copy link
Author

lsergio commented Dec 20, 2024

btw, I am using knative serving 1.15.2. If you don't face the issue with a newer release, I may try it as well.

@skonto
Copy link
Contributor

skonto commented Dec 24, 2024

Hi @lsergio. I can reproduce that scenario, the connection between QP and the app is being killed. Your app may continue until termination grace period expires if it does not handle signals and runs with pid=1. Could you share the code and how you build the app container? Here is what I got:

curl -H "Host: autoscale-go.default.example.com"  "http://192.168.39.63:32679?sleep=50000"
...
< HTTP/1.1 502 Bad Gateway
< content-length: 4
< content-type: text/plain; charset=utf-8
< date: Wed, 25 Dec 2024 22:14:09 GMT
< x-content-type-options: nosniff
< x-envoy-upstream-service-time: 48962
< server: envoy
< 
EOF

In any case you need to handle the SIGTERM signal and shutdown your application server gracefully (golang shutdown drains app requests), note that QP does that:

// Shutdown gracefully shuts down the server without interrupting any
// active connections. Shutdown works by first closing all open
// listeners, then closing all idle connections, and then waiting
// indefinitely for connections to return to idle and then shut down.
// If the provided context expires before the shutdown is complete,
// Shutdown returns the context's error, otherwise it returns any
// error returned from closing the [Server]'s underlying Listener(s).
//
// When Shutdown is called, [Serve], [ListenAndServe], and
// [ListenAndServeTLS] immediately return [ErrServerClosed]. Make sure the
// program doesn't exit and waits instead for Shutdown to return.
//
// Shutdown does not attempt to close nor wait for hijacked
// connections such as WebSockets. The caller of Shutdown should
// separately notify such long-lived connections of shutdown and wait
// for them to close, if desired. See [Server.RegisterOnShutdown] for a way to
// register shutdown notification functions.
//
// Once Shutdown has been called on a server, it may not be reused;
// future calls to methods such as Serve will return ErrServerClosed.
func (srv *Server) Shutdown(ctx context.Context) error {

Here as an example where I am using the autoscale-go sample app patched to handle the signals.

$k get po
NAME                                             READY   STATUS    RESTARTS   AGE
autoscale-go-00001-deployment-7fd9f8d686-b226j   2/2     Running   0          10s

$k delete po --all
pod "autoscale-go-00001-deployment-7fd9f8d686-b226j" deleted

$k get po
NAME                                             READY   STATUS              RESTARTS   AGE
autoscale-go-00001-deployment-7fd9f8d686-b226j   2/2     Terminating         0          14s
autoscale-go-00001-deployment-7fd9f8d686-h98rm   0/2     ContainerCreating   0          2s

$ k get po
NAME                                             READY   STATUS        RESTARTS   AGE
autoscale-go-00001-deployment-7fd9f8d686-b226j   2/2     Terminating   0          40s
autoscale-go-00001-deployment-7fd9f8d686-h98rm   2/2     Running       0          28s

$k get po
NAME                                             READY   STATUS        RESTARTS   AGE
autoscale-go-00001-deployment-7fd9f8d686-b226j   1/2     Terminating   0          41s
autoscale-go-00001-deployment-7fd9f8d686-h98rm   2/2     Running       0          29s

# QP is draining so probing is failing on purprose (1/2 ready)

#on another terminal the request succeeds after a while, QP waits as expected
$ curl -H "Host: autoscale-go.default.example.com"  "http://192.168.39.63:32679?sleep=50000"
Slept for 50000.03 milliseconds.

# Tried with larger timeouts too (easier to deomonstrate). Note that when your timeout is low then you need to make sure you delete pods fast enough. 
# To get the EOF it needs to hold that `time_since_requests_issued + 30s < sleep_time`,
otherwise request will succeed anyway.

$ curl -H "Host: autoscale-go.default.example.com"  "http://192.168.39.63:32679?sleep=90000"
Slept for 90000.53 milliseconds.

To summarize you need to handle signals independently of the programming language in the app container and make sure that the app in the container receives the signal in the first place e.g use exec format in the dockerfile, some init process etc.

@lsergio
Copy link
Author

lsergio commented Dec 26, 2024

Hi @skonto.

My application is being built with Apache Camel k, using the following code:

apiVersion: camel.apache.org/v1
kind: Integration
metadata:
  name: rest
spec:  
  replicas: 1
  sources:
  - name: main.yaml
    content: |-
      - from:
          uri: rest:get:/demo
          steps:
          - log:
              message: "Sleeping for 50 seconds"
          - delay:
              asyncDelayed: false
              expression:
                constant: "50000"
          - setBody:
              expression:
                constant: "It worked!!"
          - log:
              message: "Finished!!!"
  traits:
    camel:
      runtimeVersion: 3.15.0
      properties:
      - camel.main.shutdownTimeout=600 # this is the time that the application will wait after it receives a SIGTERM signal for the executing requests to finish
    service:
      auto: false
      type: ClusterIP
    health:
      enabled: true
    knative-service:
      timeoutSeconds: 600             # this maps to knative's revision timeout

With this config the created application container will handle the SIGTERM signal and wait for up to 600 seconds for the existing requests to finish before terminating the container.
When I kill the running pod, I can see the draining process on the application log:

2024-12-26 11:23:34,503 INFO  [route1] (executor-thread-1) Sleeping for 50 seconds
2024-12-26 11:24:10,988 INFO  [route1] (executor-thread-2) Sleeping for 50 seconds
2024-12-26 11:24:11,583 INFO  [route1] (executor-thread-3) Sleeping for 50 seconds
2024-12-26 11:24:12,085 INFO  [route1] (executor-thread-4) Sleeping for 50 seconds
2024-12-26 11:24:12,386 INFO  [route1] (executor-thread-5) Sleeping for 50 seconds
2024-12-26 11:24:12,645 INFO  [route1] (executor-thread-6) Sleeping for 50 seconds
2024-12-26 11:24:12,964 INFO  [route1] (executor-thread-7) Sleeping for 50 seconds
2024-12-26 11:24:13,216 INFO  [route1] (executor-thread-8) Sleeping for 50 seconds
2024-12-26 11:24:24,508 INFO  [route1] (executor-thread-1) Finished!!!
2024-12-26 11:24:57,407 WARN  [io.net.boo.ServerBootstrap] (vert.x-acceptor-thread-0) Failed to register an accepted channel: [id: 0x6408cfdd, L:/127.0.0.1:8080 ! R:/127.0.0.1:39802]: java.lang.IllegalStateException
	at io.vertx.core.net.impl.VertxEventLoopGroup.next(VertxEventLoopGroup.java:37)
	at io.vertx.core.net.impl.VertxEventLoopGroup.register(VertxEventLoopGroup.java:53)
	at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
	at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:97)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:553)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

2024-12-26 11:24:57,409 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (main) Apache Camel 4.8.0 (camel-1) is shutting down (timeout:10m0s)
2024-12-26 11:24:57,416 INFO  [org.apa.cam.imp.eng.DefaultShutdownStrategy] (Camel (camel-1) thread #4 - ShutdownTask) Waiting as there are still 7 inflight and pending exchanges to complete, timeout in 600 seconds. Inflights per route: [route1 = 7]
2024-12-26 11:24:58,416 INFO  [org.apa.cam.imp.eng.DefaultShutdownStrategy] (Camel (camel-1) thread #4 - ShutdownTask) Waiting as there are still 7 inflight and pending exchanges to complete, timeout in 599 seconds. Inflights per route: [route1 = 7]
2024-12-26 11:24:59,417 INFO  [org.apa.cam.imp.eng.DefaultShutdownStrategy] (Camel (camel-1) thread #4 - ShutdownTask) Waiting as there are still 7 inflight and pending exchanges to complete, timeout in 598 seconds. Inflights per route: [route1 = 7]
2024-12-26 11:25:00,417 INFO  [org.apa.cam.imp.eng.DefaultShutdownStrategy] (Camel (camel-1) thread #4 - ShutdownTask) Waiting as there are still 7 inflight and pending exchanges to complete, timeout in 597 seconds. Inflights per route: [route1 = 7]
2024-12-26 11:25:00,989 INFO  [route1] (executor-thread-2) Finished!!!
2024-12-26 11:25:01,418 INFO  [org.apa.cam.imp.eng.DefaultShutdownStrategy] (Camel (camel-1) thread #4 - ShutdownTask) Waiting as there are still 6 inflight and pending exchanges to complete, timeout in 596 seconds. Inflights per route: [route1 = 6]
2024-12-26 11:25:01,584 INFO  [route1] (executor-thread-3) Finished!!!
2024-12-26 11:25:02,086 INFO  [route1] (executor-thread-4) Finished!!!
2024-12-26 11:25:02,387 INFO  [route1] (executor-thread-5) Finished!!!
2024-12-26 11:25:02,419 INFO  [org.apa.cam.imp.eng.DefaultShutdownStrategy] (Camel (camel-1) thread #4 - ShutdownTask) Waiting as there are still 3 inflight and pending exchanges to complete, timeout in 595 seconds. Inflights per route: [route1 = 3]
2024-12-26 11:25:02,646 INFO  [route1] (executor-thread-6) Finished!!!
2024-12-26 11:25:02,965 INFO  [route1] (executor-thread-7) Finished!!!
2024-12-26 11:25:03,217 INFO  [route1] (executor-thread-8) Finished!!!
2024-12-26 11:25:03,423 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (main) Routes stopped (total:1)
2024-12-26 11:25:03,424 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (main)     Stopped route1 (rest://get:/demo)
2024-12-26 11:25:03,427 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (main) Apache Camel 4.8.0 (camel-1) shutdown in 6s17ms (uptime:3m49s)
2024-12-26 11:25:03,502 INFO  [io.quarkus] (main) camel-k-integration stopped in 6.114s

And here is a timeline for my test:

  • Submit a request (req1) to the application that will take 50s;
  • Wait 25s
  • Kill the application pod
  • Submit a second request (req2) to the application that will take 50s;
  • Queue-proxy receives the SIGTERM signal
  • Queue-proxy waits for 30s and terminates
  • Application receives the SIGTERM signal
  • Application waits for the requests to finish
  • req1 client gets a successful response
  • req2 client gets an EOF

@skonto
Copy link
Contributor

skonto commented Dec 27, 2024

I was not able to reproduce it with the autoscale sample app and by following the steps above on the terminal. The second request always goes to the newly created pod.

I am not sure how the Camel K extension works under the hood and if it shutdowns the netty stuff as expected but I see that this warning Failed to register an accepted channel that should be checked , see here as it can be a race condition during server shutdown.
Connections are being shutdown there, it seems (no event loop?), btw 127.0.0.1:39802 should be the QP.
See also this about Quarkus behavior with keepalives and QP does set set keepalives=true with HTTP1, see here. So I suspect this could be the problem, could you ask there too?

This happens at 11:24:57,407 before the second request finishes and when SIGTERM is caught. The whole context finished in 6secs (the rest of the requests are very close in time).
The fact that the executor threads sleep and finish their part does not mean that Camel K held the connections open long enough (at least we need to verify that).

Could you turn the debug level on at the app side, preferably the http protocol stuff, probably something like (haven't tested it) under Camel properties eg. camel.debug.loggingLevel=DEBUG.

In addition, looking at your logs in #15649 (comment), the second request goes in before SIGTERM is received and QP gets in draining mode. In that mode QP waits for all requests to finish (pre-stop is called):

2024-12-16 11:34:16,721 INFO  [route1] (executor-thread-1) Sleeping for 50 seconds
...
{"severity":"INFO","timestamp":"2024-12-16T11:34:21.094294769Z","logger":"queueproxy","caller":"sharedmain/handlers.go:109","message":"Attached drain handler from user-container&{GET /wait-for-drain HTTP/1.1 1 1 map[Accept:[*/*] Accept-Encoding:[gzip] User-Agent:[kube-lifecycle/1.30+]] {} <nil> 0 [] false 10.10.78.191:8022 map[] map[] <nil> map[] 10.251.147.69:47034 /wait-for-drain <nil> <nil> <nil> 0xc0002242d0 0xc00007e6c0 [] map[]}","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-798c999c79-49t9g"}

At first glance there is no reason for the second request not to be handled by QP unless the app kills the connection.
Could you try in your env without a Camel K app?

@lsergio
Copy link
Author

lsergio commented Dec 30, 2024

Hi @skonto . I'll try it with a non camel k app. But anyway, I see you mentioned:

The second request always goes to the newly created pod.

But the problem is not actually related to requests that are started after the new pod is ready. The problem happens with requests that are being processed by a pod that is terminated. the qp waits for 30s and shuts down, even if the ongoing request is not still finished.

Per my tests with the camel k app, it looks like the application is receiving the SIGTERM signal only after the QP is fully terminated. So the aplication is still operating while the QP is handling the signal.

@lsergio
Copy link
Author

lsergio commented Dec 30, 2024

here are some logs where there is no request being processed:

QP proxy log:

{"severity":"INFO","timestamp":"2024-12-30T11:28:40.616501786Z","logger":"queueproxy","caller":"sharedmain/handlers.go:109","message":"Attached drain handler from user-container&{GET /wait-for-drain HTTP/1.1 1 1 map[Accept:[*/*] Accept-Encoding:[gzip] User-Agent:[kube-lifecycle/1.30+]] {} <nil> 0 [] false 10.10.205.62:8022 map[] map[] <nil> map[] 10.251.255.215:49574 /wait-for-drain <nil> <nil> <nil> 0xc0001823c0 0xc00007f3e0 [] map[]}","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}
{"severity":"INFO","timestamp":"2024-12-30T11:28:40.632854852Z","logger":"queueproxy","caller":"sharedmain/main.go:305","message":"Received TERM signal, attempting to gracefully shutdown servers.","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}
{"severity":"INFO","timestamp":"2024-12-30T11:28:40.632910392Z","logger":"queueproxy","caller":"sharedmain/main.go:306","message":"Sleeping 30s to allow K8s propagation of non-ready state","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}
{"severity":"INFO","timestamp":"2024-12-30T11:29:10.616920068Z","logger":"queueproxy","caller":"sharedmain/main.go:310","message":"Shutting down server: main","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}
{"severity":"INFO","timestamp":"2024-12-30T11:29:10.617180611Z","logger":"queueproxy","caller":"sharedmain/main.go:310","message":"Shutting down server: admin","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}
{"severity":"INFO","timestamp":"2024-12-30T11:29:10.617247112Z","logger":"queueproxy","caller":"sharedmain/main.go:310","message":"Shutting down server: metrics","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}
{"severity":"INFO","timestamp":"2024-12-30T11:29:10.617334093Z","logger":"queueproxy","caller":"sharedmain/main.go:322","message":"Shutdown complete, exiting...","commit":"038a181","knative.dev/key":"mynamespace/rest-00001","knative.dev/pod":"rest-00001-deployment-6459b7cf-9rjhk"}

At 11:28:40 the queue proxy receives the TERM signal, sleeps for 30s and finishes shutting down at 11:29:10.

And this is the application log:

2024-12-26 11:45:25,362 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (main) Apache Camel 4.8.0 (camel-1) started in 82ms (build:0ms init:0ms start:82ms)
2024-12-26 11:45:25,759 INFO  [io.quarkus] (main) camel-k-integration 2.4.0 on JVM (powered by Quarkus 3.15.1) started in 7.840s. Listening on: http://0.0.0.0:8080
2024-12-26 11:45:25,759 INFO  [io.quarkus] (main) Profile prod activated. 
2024-12-26 11:45:25,760 INFO  [io.quarkus] (main) Installed features: [camel-attachments, camel-bean, camel-core, camel-k-core, camel-k-runtime, camel-kubernetes, camel-microprofile-health, camel-platform-http, camel-rest, camel-yaml-dsl, cdi, kubernetes-client, smallrye-context-propagation, smallrye-health, vertx]
2024-12-30 11:29:10,643 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (Shutdown thread) Apache Camel 4.8.0 (camel-1) is shutting down (timeout:10m0s)
2024-12-30 11:29:10,651 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (Shutdown thread) Routes stopped (total:1)
2024-12-30 11:29:10,652 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (Shutdown thread)     Stopped route1 (rest://get:/demo)
2024-12-30 11:29:10,656 INFO  [org.apa.cam.imp.eng.AbstractCamelContext] (Shutdown thread) Apache Camel 4.8.0 (camel-1) shutdown in 13ms (uptime:3d23h)

Shutdown starts at 11:29:10, right after the qp proxy fully terminates.

If I repeat the tests and submit some long requests, I see the same behavior where the application only gets the termination signal after the QP is terminated. And the existing requests on the the terminating pod that do not finish within the 30s timeframe will fail, even though the termination grace period is way longer.

@skonto
Copy link
Contributor

skonto commented Dec 30, 2024

Hi @lsergio,

But the problem is not actually related to requests that are started after the new pod is ready.

I know I am just reporting that the steps are not enough to reproduce it. Probably it requires to test via code to actually hit the terminated pod. However, even if you hit the terminated pod during draining the timer will be reset and the request should be processed. QP will not terminate if there are pending requests and connections are not broken.

In any case, If the camelk application is shutting down the http server early without a graceful shutdown (it seems so and there is an exception that needs debugging) of the connections, as in the issue I pointed, then you will have the behavior you observe. Pls discuss this at the issue quarkusio/quarkus#18890 (comment) I pointed at the camel project side first. It seems there is a known issue there.

@lsergio
Copy link
Author

lsergio commented Jan 2, 2025

Hi @skonto . I created a golang app to test this scenario without camel k and indeed the queue-proxy is waiting for the requests to finish. I'll figure it out now what's wrong with the camel application.
Thanks for your support!

@lsergio lsergio closed this as completed Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants