-
Notifications
You must be signed in to change notification settings - Fork 13
Analysis tasks being stuck in worker #338
Comments
Hi @borisbaldassari , I had the same issue (crossminer/scava-deployment#89). I was able to get tasks analysed by excluding some services from the docker compose and limiting the number of metrics. Please have a look at https://github.com/crossminer/scava-deployment/tree/dev#task-analysis, maybe it will also work for you. |
Hi @valeriocos, thanks for the head up! :-) Yeah, I've seen the issue. Looks like it's not exactly the same symptom: I still have the worker and tasks displayed in the UI, although the heartbeat is staled, and the errors you get when stopping the containers do not happen on my side. Plus, when I stop/start the containers it restarts the computation and then stops shortly afterwards -- which I don't think is your exact case either. |
@valeriocos I definitely confirm that the task is stuck in the worker independently of the creation of a new task, I've not created a new task since this one was fired. The heartbeat has been staled for more than 7 hours now -- I've used all metrics, but on a limited time of range (from 2018/01/01). |
Hi @borisbaldassari , thank you for the update. |
Hey @valeriocos, thanks for the feedback! ;-) |
Hi guys, just let me add a small correction. |
@ALL: I think I found where and why the tasks get stuck, but I'll need some time today to come up with a fix. In the meantime, for those who experience the same issue, could you please run the following list of commands when a task gets stuck and copy/paste the result in this thread.
|
Hey @tdegueul, thanks for investigating. Hum.. When there are several workers (i.e. oss-app containers) started, how do we know which one is the w2, w3, w4? |
I observed on different Scava instances that many times the whole platform get stuck because of a 3a93b2b kills and retries |
@borisbaldassari when you run the |
This is a workaround for #276 isn't ? |
Indeed, I think both problems are due to the same cause; though there might be other deadlocks / stuck processes somewhere else. |
Indeed, I think this is not the only cause for stuck task. Here is a case I'm running into, see logs:
in the other container log, the following line are interesting:
|
@mhow2 regarding your latest logs, looking at the Restmule code, this specific trace should not be blocking/lagging execution, but we will look into it further, for example by adding max retries to fixed-time retry codes. |
@borisbaldassari , can you issue a 'ps axf' in the stalled workers as described in #276 . This specific issue shouldn't happen since @tdegueul has implemented a workaround but we never know. |
I'm using a rather "old" build but now that I look into my instance, I also have task with heartbeat stuck since 17/09/2019 15:55:09 on :
I'm attaching the container logs. I think the "leaked connection" thing has been tweaked by @kb634 so I need to retry with a newer build. |
Update: still stuck after a few days.. @mhow2 output of ps axf is:
As for the logs (of one of the workers): |
Well I've just restarted one of the slaves which had stopped, and now all workers have vanished from the UI! They are still running (docker ps) but won't show up... Similar to crossminer/scava-deployment#95 |
The instance was updated on friday and it seems the tasks are not frozen anymore in the workers. Yeah! ;-) However I can't add new tasks now, see issue #383 for more information. Since I'm not sure if it's related or not I'll keep this issue opened until we know. |
Just report that we got a similar error this weekend. The worker (only one was running) got stuck. A This was our error:
|
@MarcioMateus, I'll push tomorrow the modifications done by York for preventing the github issue. I just created the jars but I need to test the readers before pushing the changes to jenkins for building. |
@MarcioMateus, commit 2aa0121 contains the newest version Restmule in the GitHub reader which should prevent the error to happen. |
To clarify further, as we had discussed with Davide in the call we had on the 16th of September, Restmule will retry up to 10 times (on a 1 minute delay each time) for certain types of HTTP codes (such as 502) and after that throw an exception (the one seen above) for its caller (in this case the Reader) to handle. As such, when using the latest version, if this message is seen it means that Restmule is appropriately reporting a failure and is expecting to be terminated by its caller. |
Thanks @creat89 and @kb634. So, is the "bug" fixed? Is the reader (or the metric-platform) handling the exception? |
@MarcioMateus I will need to catch the exception on the reader if RESTMULE fails to recover, but I need to check with Softeam which exception I could use to just pause the task for a while and continue working with other tasks. |
When creating analysis tasks for projects, they seem to be stuck in the worker. Heartbeat is not updated, and nothing happens (percentage or metric processed don't change, ..).
Stopping and re-starting the oss-app or the whole stack resumes the analysis but it shortly stops after (few minutes).
The text was updated successfully, but these errors were encountered: