Analysis tasks being stuck in worker #338

borisbaldassari · 2019-08-25T16:50:22Z

When creating analysis tasks for projects, they seem to be stuck in the worker. Heartbeat is not updated, and nothing happens (percentage or metric processed don't change, ..).

Stopping and re-starting the oss-app or the whole stack resumes the analysis but it shortly stops after (few minutes).

valeriocos · 2019-08-26T07:24:44Z

Hi @borisbaldassari , I had the same issue (crossminer/scava-deployment#89). I was able to get tasks analysed by excluding some services from the docker compose and limiting the number of metrics. Please have a look at https://github.com/crossminer/scava-deployment/tree/dev#task-analysis, maybe it will also work for you.

borisbaldassari · 2019-08-26T08:42:57Z

Hi @valeriocos, thanks for the head up! :-)

Yeah, I've seen the issue. Looks like it's not exactly the same symptom: I still have the worker and tasks displayed in the UI, although the heartbeat is staled, and the errors you get when stopping the containers do not happen on my side. Plus, when I stop/start the containers it restarts the computation and then stops shortly afterwards -- which I don't think is your exact case either.

borisbaldassari · 2019-08-26T09:15:59Z

Update after doing a down/up:

registered projects have vanished (I thought they were kept between down/up, but it must be when doing a stop/start)..
When re-creating the tasks they're processed, i.e. they enter the worker, but then it stop again (no heartbeat)..

borisbaldassari · 2019-08-26T19:16:41Z

Head up: I've tried to start 2 slaves, and surprisingly enough one of them never freezes. The other still does.

borisbaldassari · 2019-08-29T21:12:04Z

It happened again, on a different worker (note the heartbeat):

borisbaldassari · 2019-08-31T15:45:50Z

@valeriocos I definitely confirm that the task is stuck in the worker independently of the creation of a new task, I've not created a new task since this one was fired. The heartbeat has been staled for more than 7 hours now -- I've used all metrics, but on a limited time of range (from 2018/01/01).

valeriocos · 2019-08-31T16:57:49Z

Hi @borisbaldassari , thank you for the update.
Have you tried to do a docker-compose stop/start and see if the worker goes back to life? Another option is to delete the freezed analysis and create it again. If I'm not wrong these 2 solutions were successful for @MarcioMateus , in my case they never worked.

borisbaldassari · 2019-09-02T06:30:52Z

Hey @valeriocos, thanks for the feedback! ;-)
Yup, tried the stop/start. The worker comes back to life, works for some time, and then fails (freezes) again. Now that you say it, I never tried to really delete the task -- actually since i cannot 'interupt the analysis' it didn't come to mind to delete the task itself, but I'll give it a try!

MarcioMateus · 2019-09-02T11:42:56Z

Hi guys, just let me add a small correction.
I usually stop the execution of the task and start it again. If i don't do it after the restart of the platform then the tasks keeps in a frozen state. This action does not cause the loss to data.

tdegueul · 2019-09-03T07:41:48Z

@ALL: I think I found where and why the tasks get stuck, but I'll need some time today to come up with a fix. In the meantime, for those who experience the same issue, could you please run the following list of commands when a task gets stuck and copy/paste the result in this thread. [...] are comments.

$ docker ps [note the ID of the container oss-app, e.g., XXX]
$ docker exec -it XXX bash [log into the container]
> jps
116 org.eclipse.equinox.launcher_1.3.100.v20150511-1540.jar [note this number]
2926 Jps
> jstack 116 | tee out [copy the current stack trace in a file 'out']
> exit
$ docker cp XXX:/ossmeter/out . [copy the file from the container to the host]
[copy/paste the content of the file 'out' here]

borisbaldassari · 2019-09-03T12:44:16Z

Hey @tdegueul, thanks for investigating. Hum.. When there are several workers (i.e. oss-app containers) started, how do we know which one is the w2, w3, w4?

tdegueul · 2019-09-05T09:28:58Z

I observed on different Scava instances that many times the whole platform get stuck because of a git pull operation on the remote repository that never completes.

3a93b2b kills and retries git pull operations if they do not complete in 5 minutes and let us know what happens in the logs. This is rather arbitrary and definitely not a perfect solution, but it should at least let us know when there is a problem with a git pull.

tdegueul · 2019-09-05T09:29:53Z

@borisbaldassari when you run the docker ps command above, the name of the container should tell you who's who.

mhow2 · 2019-09-05T10:45:57Z

I observed on different Scava instances that many times the whole platform get stuck because of a git pull operation on the remote repository that never completes.

3a93b2b kills and retries git pull operations if they do not complete in 5 minutes and let us know what happens in the logs. This is rather arbitrary and definitely not a perfect solution, but it should at least let us know when there is a problem with a git pull.

This is a workaround for #276 isn't ?

tdegueul · 2019-09-05T13:36:47Z

I observed on different Scava instances that many times the whole platform get stuck because of a git pull operation on the remote repository that never completes.
3a93b2b kills and retries git pull operations if they do not complete in 5 minutes and let us know what happens in the logs. This is rather arbitrary and definitely not a perfect solution, but it should at least let us know when there is a problem with a git pull.

This is a workaround for #276 isn't ?

Indeed, I think both problems are due to the same cause; though there might be other deadlocks / stuck processes somewhere else.

mhow2 · 2019-09-13T07:14:08Z

Indeed, I think this is not the only cause for stuck task. Here is a case I'm running into, see logs:
metric-pf-slave.log
oss-slave-container.log

WARN  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Project in error state. Resuming execution.
INFO  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Date 20180209 Task Execution ( scheduling:all completed in 252085 ms )
INFO  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Project schedulingTask execution ( scheduling:all : Date 20180210 )
INFO  [AnalysisSchedulingService] (15:47:04): Starting new daily execution AnalysisTask 'scheduling:all'
INFO  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Date: 20180210, project: scheduling
INFO  [ProjectDelta (scheduling,20180210)] (15:47:04): Creating Delta

in the other container log, the following line are interesting:

oss-slave_1       | AbstractInterceptor.intercept( https://api.github.com/repos/ow2-proactive/scheduling/issues?filter=all&state=all&labels=&sort=created&direction=asc&since=20180210&per_page=100&page=1 )
oss-slave_1       | 16:49:19.014 ERROR [213] AbstractInterceptor$1 - UNKNOWN ERROR CODE: 502
oss-slave_1       | Exception in thread "pool-7-thread-1" java.lang.UnsupportedOperationException: Do not know how to handle this type of error code: 502
oss-slave_1       | 	at org.eclipse.scava.crossflow.restmule.core.interceptor.AbstractInterceptor$1.intercept(AbstractInterceptor.java:214)
oss-slave_1       | 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
oss-slave_1       | 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
oss-slave_1       | 	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
oss-slave_1       | 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:147)
oss-slave_1       | 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
oss-slave_1       | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
oss-slave_1       | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
oss-slave_1       | 	at java.lang.Thread.run(Thread.java:748)
oss-slave_1       | Sep 12, 2019 11:38:47 PM okhttp3.internal.platform.Platform log
oss-slave_1       | WARNING: A connection to https://api.github.com/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);

barmpis · 2019-09-16T09:16:58Z

@mhow2 regarding your latest logs, looking at the Restmule code, this specific trace should not be blocking/lagging execution, but we will look into it further, for example by adding max retries to fixed-time retry codes.

borisbaldassari · 2019-09-19T08:21:31Z

Update: not sure if the workers are stuck or not.. one of them died last night (several workers have been setup so I still have 3 of them running), and the others have a staled heartbeat, as shown on picture.

mhow2 · 2019-09-19T09:00:15Z

@borisbaldassari , can you issue a 'ps axf' in the stalled workers as described in #276 . This specific issue shouldn't happen since @tdegueul has implemented a workaround but we never know.
Also I guess you'll be asked for containers logs.
BTW, which build are you using ? (trick question).

mhow2 · 2019-09-19T09:35:40Z

I'm using a rather "old" build but now that I look into my instance, I also have task with heartbeat stuck since 17/09/2019 15:55:09 on :

INFO [ProjectDelta (spoon,20190313)] (13:55:09): Creating Delta

I'm attaching the container logs. I think the "leaked connection" thing has been tweaked by @kb634 so I need to retry with a newer build.

oss-app-20190919.log.gz

borisbaldassari · 2019-09-22T14:28:54Z

Update: still stuck after a few days..

@mhow2 output of ps axf is:

root@f3a7c63d1537:/ossmeter# ps axf
  PID TTY      STAT   TIME COMMAND
10680 pts/0    Ss     0:00 bash
10685 pts/0    R+     0:00  \_ ps axf
    1 ?        Ss     0:00 ./eclipse -worker w3 -config prop.properties
   31 ?        Sl   608:42 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8192 -Dco

As for the logs (of one of the workers):
oss_app_logs.log

borisbaldassari · 2019-09-22T14:36:18Z

Well I've just restarted one of the slaves which had stopped, and now all workers have vanished from the UI! They are still running (docker ps) but won't show up...

Similar to crossminer/scava-deployment#95

borisbaldassari · 2019-09-28T11:22:00Z

The instance was updated on friday and it seems the tasks are not frozen anymore in the workers. Yeah! ;-)

However I can't add new tasks now, see issue #383 for more information. Since I'm not sure if it's related or not I'll keep this issue opened until we know.

MarcioMateus · 2019-09-30T14:38:28Z

Indeed, I think this is not the only cause for stuck task. Here is a case I'm running into, see logs:
metric-pf-slave.log
oss-slave-container.log

WARN  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Project in error state. Resuming execution.
INFO  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Date 20180209 Task Execution ( scheduling:all completed in 252085 ms )
INFO  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Project schedulingTask execution ( scheduling:all : Date 20180210 )
INFO  [AnalysisSchedulingService] (15:47:04): Starting new daily execution AnalysisTask 'scheduling:all'
INFO  [ProjectExecutor (w2:scheduling:all)] (15:47:04): Date: 20180210, project: scheduling
INFO  [ProjectDelta (scheduling,20180210)] (15:47:04): Creating Delta

in the other container log, the following line are interesting:

oss-slave_1       | AbstractInterceptor.intercept( https://api.github.com/repos/ow2-proactive/scheduling/issues?filter=all&state=all&labels=&sort=created&direction=asc&since=20180210&per_page=100&page=1 )
oss-slave_1       | 16:49:19.014 ERROR [213] AbstractInterceptor$1 - UNKNOWN ERROR CODE: 502
oss-slave_1       | Exception in thread "pool-7-thread-1" java.lang.UnsupportedOperationException: Do not know how to handle this type of error code: 502
oss-slave_1       | 	at org.eclipse.scava.crossflow.restmule.core.interceptor.AbstractInterceptor$1.intercept(AbstractInterceptor.java:214)
oss-slave_1       | 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
oss-slave_1       | 	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
oss-slave_1       | 	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
oss-slave_1       | 	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:147)
oss-slave_1       | 	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
oss-slave_1       | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
oss-slave_1       | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
oss-slave_1       | 	at java.lang.Thread.run(Thread.java:748)
oss-slave_1       | Sep 12, 2019 11:38:47 PM okhttp3.internal.platform.Platform log
oss-slave_1       | WARNING: A connection to https://api.github.com/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);

Just report that we got a similar error this weekend. The worker (only one was running) got stuck. A docker down and docker up put the platform to run again.

This was our error:

AbstractInterceptor.intercept( https://api.github.com/repos/adafruit/Adafruit_GPS/issues/41/comments?per_page=100&page=1 )
05:17:04.697 ERROR [213] AbstractInterceptor$1 - UNKNOWN ERROR CODE: 502
Exception in thread "pool-11-thread-1" java.lang.UnsupportedOperationException: Do not know how to handle this type of error code: 502
	at org.eclipse.scava.crossflow.restmule.core.interceptor.AbstractInterceptor$1.intercept(AbstractInterceptor.java:214)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:147)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

creat89 · 2019-09-30T18:18:03Z

@MarcioMateus, I'll push tomorrow the modifications done by York for preventing the github issue. I just created the jars but I need to test the readers before pushing the changes to jenkins for building.

creat89 · 2019-10-01T13:21:14Z

@MarcioMateus, commit 2aa0121 contains the newest version Restmule in the GitHub reader which should prevent the error to happen.

barmpis · 2019-10-01T14:18:28Z

To clarify further, as we had discussed with Davide in the call we had on the 16th of September, Restmule will retry up to 10 times (on a 1 minute delay each time) for certain types of HTTP codes (such as 502) and after that throw an exception (the one seen above) for its caller (in this case the Reader) to handle.

As such, when using the latest version, if this message is seen it means that Restmule is appropriately reporting a failure and is expecting to be terminated by its caller.

MarcioMateus · 2019-10-01T14:44:01Z

Thanks @creat89 and @kb634.

So, is the "bug" fixed? Is the reader (or the metric-platform) handling the exception?

creat89 · 2019-10-01T14:46:48Z

@MarcioMateus I will need to catch the exception on the reader if RESTMULE fails to recover, but I need to check with Softeam which exception I could use to just pause the task for a while and continue working with other tasks.

borisbaldassari changed the title ~~Workers being stuck in worker~~ Analysis tasks being stuck in worker Aug 25, 2019

valeriocos mentioned this issue Aug 28, 2019

[feedback] Getting repos analysed crossminer/scava-deployment#94

Open

borisbaldassari added the bug label Sep 3, 2019

borisbaldassari added the high priority label Sep 17, 2019

creat89 mentioned this issue Oct 1, 2019

Exception for pausing the tasks due failed data retrieval #395

Open

This was referenced Jan 6, 2020

Exception for pausing the tasks due failed data retrieval eclipse-researchlabs/scava#58

Open

[feedback] Getting repos analysed eclipse-researchlabs/scava-deployment#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis tasks being stuck in worker #338

Analysis tasks being stuck in worker #338

borisbaldassari commented Aug 25, 2019

valeriocos commented Aug 26, 2019

borisbaldassari commented Aug 26, 2019

borisbaldassari commented Aug 26, 2019 •

edited

Loading

borisbaldassari commented Aug 26, 2019

borisbaldassari commented Aug 29, 2019

borisbaldassari commented Aug 31, 2019

valeriocos commented Aug 31, 2019

borisbaldassari commented Sep 2, 2019

MarcioMateus commented Sep 2, 2019

tdegueul commented Sep 3, 2019

borisbaldassari commented Sep 3, 2019 •

edited

Loading

tdegueul commented Sep 5, 2019

tdegueul commented Sep 5, 2019

mhow2 commented Sep 5, 2019

tdegueul commented Sep 5, 2019 •

edited

Loading

mhow2 commented Sep 13, 2019 •

edited

Loading

barmpis commented Sep 16, 2019

borisbaldassari commented Sep 19, 2019

mhow2 commented Sep 19, 2019

mhow2 commented Sep 19, 2019

borisbaldassari commented Sep 22, 2019 •

edited

Loading

borisbaldassari commented Sep 22, 2019 •

edited

Loading

borisbaldassari commented Sep 28, 2019 •

edited

Loading

MarcioMateus commented Sep 30, 2019

creat89 commented Sep 30, 2019 •

edited

Loading

creat89 commented Oct 1, 2019

barmpis commented Oct 1, 2019

MarcioMateus commented Oct 1, 2019

creat89 commented Oct 1, 2019

Analysis tasks being stuck in worker #338

Analysis tasks being stuck in worker #338

Comments

borisbaldassari commented Aug 25, 2019

valeriocos commented Aug 26, 2019

borisbaldassari commented Aug 26, 2019

borisbaldassari commented Aug 26, 2019 • edited Loading

borisbaldassari commented Aug 26, 2019

borisbaldassari commented Aug 29, 2019

borisbaldassari commented Aug 31, 2019

valeriocos commented Aug 31, 2019

borisbaldassari commented Sep 2, 2019

MarcioMateus commented Sep 2, 2019

tdegueul commented Sep 3, 2019

borisbaldassari commented Sep 3, 2019 • edited Loading

tdegueul commented Sep 5, 2019

tdegueul commented Sep 5, 2019

mhow2 commented Sep 5, 2019

tdegueul commented Sep 5, 2019 • edited Loading

mhow2 commented Sep 13, 2019 • edited Loading

barmpis commented Sep 16, 2019

borisbaldassari commented Sep 19, 2019

mhow2 commented Sep 19, 2019

mhow2 commented Sep 19, 2019

borisbaldassari commented Sep 22, 2019 • edited Loading

borisbaldassari commented Sep 22, 2019 • edited Loading

borisbaldassari commented Sep 28, 2019 • edited Loading

MarcioMateus commented Sep 30, 2019

creat89 commented Sep 30, 2019 • edited Loading

creat89 commented Oct 1, 2019

barmpis commented Oct 1, 2019

MarcioMateus commented Oct 1, 2019

creat89 commented Oct 1, 2019

borisbaldassari commented Aug 26, 2019 •

edited

Loading

borisbaldassari commented Sep 3, 2019 •

edited

Loading

tdegueul commented Sep 5, 2019 •

edited

Loading

mhow2 commented Sep 13, 2019 •

edited

Loading

borisbaldassari commented Sep 22, 2019 •

edited

Loading

borisbaldassari commented Sep 22, 2019 •

edited

Loading

borisbaldassari commented Sep 28, 2019 •

edited

Loading

creat89 commented Sep 30, 2019 •

edited

Loading