file/thread/socket descriptors leak #244

mhow2 · 2019-05-27T14:29:15Z

As already mentioned in Slack in the general channel:

I'm running the regular docker stack with one slave.
Two analysis in // , year 2018, all metrics on projects https://github.com/INRIA/spoon/ and https://gitlab.ow2.org/clif/clif-legacy

Once started, I gather both MP's PIDs 10959 (w1 clif-legacy), 10985 (w2 spoon) and run:

watch 'sudo lsof -p 10959 -p 10985 |wc ; sudo lsof -p 10959 -p 10985 |grep sock |wc ; ps -o thcount 10959 10985'

So I get 1) the number of file descriptors (any kind) 2) the number of sockets descriptors 3) thread count on both.

After some minutes I observe that:

There are 2904 file descriptors opened for both PIDs , 1490 of those are sockets (not bound, not used). w1 gets 13007 threads and w2 381.

If I wait more, it will eventual reach 32000 threads, get java.lang.OutOfMemoryError: unable to create new native thread in the MP's logs and loose any control to the server, getting messages in active SSH session that the system cannot fork new processes.

The text was updated successfully, but these errors were encountered:

This commit could fix issue #244

blueoly · 2019-05-27T19:35:45Z

If my code was to blame for this issue, then aac36db should fix it. @mhow2 please check it.

mhow2 · 2019-05-28T07:37:28Z

Unfortunately, it doesn't solve the issue. I'm still hitting several thousand of threads. Did you test it on your side ?

On my side, I did produce a stack trace please have a look.

mhow2 · 2019-05-28T07:44:49Z

I just reproduced the issue with https://gitlab.ow2.org/clif/clif-legacy you can have a look to this project specifically when testing.

creat89 · 2019-05-28T14:18:48Z

I was able to reproduce the issue. Here is the things that I used to reproduce it:

Project: https://gitlab.ow2.org/clif/clif-legacy
Date range: 01/01/2018 - 31/12/2018
How was reproduced: Using Eclipse in Debug Mode (The next post will be in normal mode)
Metrics used: Only one, org.eclipse.scava.metricprovider.trans.plaintextprocessing.PlainTextProcessingTransMetricProvider as Martin indicated us that the metrics were not the issue.
Description:
The issue started to become noticeable when the delta to analyze was the 15/03/2018. The logs shown by Eclipse showed that the process was on creating delta, however, Eclipse was quite slow, so the printing of the logs might have been late. I had to stop the process as I was running out of RAM, and the CPU was around 100% on a i7.
I include the jfr file thas was generated using the JMC (you can open it with JMC). As you will observe I had to stop everything aroung 14:52.
According to the Hot Packages, the issue was related to org.eclipse.jdt.internal.debug.core.model 1,858 77.611%

flight_recording_180172TheJVMRunningMissionControl_1.zip

EDIT: I'm running the task this time not in debug mode and the day of breaking has changed. I'm past over the 15/03/18

creat89 · 2019-05-28T15:38:31Z

I was able to reproduce the issue again, but this time in normal mode instead of debug mode. Here is the things that I used to reproduce it:

Project: https://gitlab.ow2.org/clif/clif-legacy
Date range: 01/01/2018 - 31/12/2018
How was reproduced: Using Eclipse
Metrics used: Only one, org.eclipse.scava.metricprovider.trans.plaintextprocessing.PlainTextProcessingTransMetricProvider as Martin indicated us that the metrics were not the issue.
Description:
The issue started to become noticeable when the delta started to analyze the April 2018.
I include the jfr file thas was generated using the JMC (you can open it with JMC).

flight_recording_180172TheJVMRunningMissionControl_2.zip

blueoly · 2019-05-29T00:03:43Z

I took a look at the recordings that Adrian sent. The recordings do not show that there is an excessive use of memory. I could not come to a conclusion.

creat89 · 2019-05-29T05:18:24Z

I will test today with other readers. Just to know whether the issue comes from the platform or from gitlab reader.

tdegueul · 2019-05-29T12:12:05Z

Hi everyone,

After closer inspection, I may have found the problem based on two observations:

Thread explosion coincide with Delta creation
The problematic threads are of the form pool-X-thread-1, with X reaching the tens of thousands

This means that during Delta creation, the platform is creating thousands of Thread pools that all contain a single thread. Diving a bit into the code, I think the following components (bug tracking systems managers) are affected: BitbucketManager, GitLabManager, EclipseForumsManager.

For instance, have a look at https://github.com/crossminer/scava/blob/dev/metric-platform/platform-extensions/org.eclipse.scava.platform.bugtrackingsystem.gitlab/src/org/eclipse/scava/platform/bugtrackingsystem/gitlab/GitLabManager.java#L381-L382, which is invoked when creating the Deltas:

ScheduledExecutorService newClientService = Executors.newSingleThreadScheduledExecutor();
newClientService.scheduleAtFixedRate(runnable, 1, 1, TimeUnit.MINUTES);

This thread pool does not terminate. It creates a single-thread pool that indefinitely runs every minute. I am not sure why this is needed in the first place (is this some kind of 'retry' mechanism? why does it only start after 1 minute and not right away?). So we end up stacking pools of threads that are never closed.

This code is copy/pasted in other bug tracking managers. @creat89 do you think this might be the issue? If so, may you have a look and implement a mechanism to somehow shutdown the pool, or just remove it and do the same thing without relying on threads?

The GitHub manager doesn't reuse the same code so it shouldn't be affected. @mhow2 is it? If so, there might be another issue to fix.

There are other strange things going on in the profiler, but one step at a time ;)

creat89 · 2019-05-29T12:24:25Z

Hello @tdegueul, I cannot tell exactly as I'm not the developer of that code, it is @Danny2097. I agree that the thread should be terminated and yes, the goal is to wait until we can have more calls to do the requests. (@Danny2097?) The tests have been done using OW2's Gitlab server, which do not set any type of limit in the number of requests, thus, this shouldn't affect (I guess).
With regards GitHub, according to @mhow2, this has been observed also in the GitHub reader. But I haven't arrived to test enough GitHub (as I get an error but from RESTMULE). I'll try again today.

This morning I tested with Bugzilla and Redmine, and nothing similar was found. However, I was thinking that maybe the projects were small to reproduce the error. I tried to use the NTTP reader, but the channels that are large enough are taking too long even to create the first delta.

tdegueul · 2019-05-29T12:34:49Z

I'm not sure when the code is actually invoked.

The method public void setClient(GitLabTracker gitlabTracker) throws IOException { is kinda weird. The method recursively calls itself (?) through the use of threads (?) (ie. creates a Runnable that calls the method that creates the Runnable), and, if there is no "too many requests" (open_id.contains("Too Many Requests") == false), runs this Runnable periodically.

I have a very hard time understanding what it does, and indeed more tests are needed to understand if this is really what's happening. But I'm pretty sure there is something wrong here. Under certain circumstances (no idea which -- maybe when there's a "too many requests" happening), you may end up having thousands of threads triggered at a given moment, which will create thousands of threads in turn since they call themselves, etc.

Danny2097 · 2019-05-29T12:35:43Z

The idea behind the scheduled executer is to provide a mechanism to the readers which when the call counts reach zero or when a Oauth token expires it will pause until the either a new token is generated or the call counter resets.

I am open to other suggestions to implement this kind of functionality into the readers.

Danny2097 · 2019-05-29T13:30:20Z

Following discussions with @tdegueul and @creat89 I have removed the problematic code from the following readers gitlab bitbucket in commits eb50692, 8b23255, d68be58 . We are currently working on a fix for eclipse forums as this supports OAuth authentication.

tdegueul · 2019-05-29T14:25:40Z

It seems this was the issue and clif-legacy isn't running wild on my machine anymore... Please let us know @mhow2!

mhow2 · 2019-05-31T11:38:50Z

Today I have deployed the new build and tried twice on the same project clif-legacy, whole 2018, all metrics.

The first time I got crazy a load average (like 100) so I stopped the platform.

Now that I'm running the second try, it is not "wild" but I can tell for sure there is still an issue somewhere. Currently at 63% of the analysis, there are about 4000 threads running on the server and around 1700 sockets opened by the related java process.

creat89 · 2019-05-31T11:54:35Z

@mhow2 Which metrics are you running? Are you using all the metrics?

mhow2 · 2019-05-31T12:03:03Z

All metrics.
I got clif-legacy analysis to finish ! ... Well it's finished but I still see 2700 sockets open and about 5000 threads.

creat89 · 2019-06-01T12:43:02Z

hello @mhow2, I have pushed a commit that might be also related to this issue: feecc8d

Because the indexing metrics were being executed at the same time than the transient metrics, instead at the end. Le me know if you see any changes.

blueoly · 2019-06-02T23:45:00Z

Hello. I just faced the following error, while I run all the metrics for a period for around a month at this repo https://github.com/ruby/ruby .

https://gist.github.com/blueoly/5d8d8d77ffe9b8a7e216a045dc74456c

creat89 · 2019-06-03T04:03:57Z

@blueoly, I had that issue but here crossminer/scava-deployment#63. I'm not sure it's related to this issue. I thought that it was related to using a local generated version of the metric platform.

mhow2 · 2019-06-03T06:36:19Z

I'm just posting this two pics to illustrate the fact that the thread/memory allocation is growing even after the task (on clif-legacy) is reported as finished in the admin UI.

blueoly · 2019-06-03T11:30:04Z

Hi @creat89 ,

I saw your issue but in my case I noticed a memory related warning before the error stack trace and this is the reason I put it here.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000678f00000, 665321472, 0) failed; error='Cannot allocate memory' (errno=12)

I do not know if this is the case for you too.

tdegueul · 2019-06-03T12:07:04Z

@mhow2 @blueoly If you're using the latest version of scava-deployment, the port 8192 should now be exposed by oss-app to enable remote debugging and profiling. Do you have public instances of Scava where the problem occurs and the port 8192 is exposed that I could connect to?

blueoly · 2019-06-03T12:10:14Z

It was not a public instance. I will ask my admin to open that port and I will come back to you.

tdegueul · 2019-06-03T13:45:38Z

I forgot: first, you need to update the -Djava.rmi.server.hostname=localhost in https://github.com/crossminer/scava-deployment/blob/dev/metric-platform/Dockerfile with your own public hostname.

creat89 · 2019-06-03T13:48:43Z

Yes, in my case it works. I guess I used the wrong port on Saturday

mhow2 · 2019-06-05T07:25:09Z

So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018

The analysis hang at 7% on historic metric. JMX is available at scava-dev.ow2.org:8192

mhow2 · 2019-06-05T12:55:40Z

So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018

The analysis hang at 7% on historic metric. JMX is available at scava-dev.ow2.org:8192

This observation doesn't seem related with the current leak issue. Please ignore it.

mhow2 assigned blueoly and ambpro May 27, 2019

mhow2 changed the title ~~file (or socket) descriptors leak~~ file/thread/socket descriptors leak May 27, 2019

blueoly added a commit that referenced this issue May 27, 2019

add finally clause to close always BufferReader

b87b4e7

This commit could fix issue #244

Danny2097 added a commit that referenced this issue May 29, 2019

fixes to bitbucket relating to issue #244

d68be58

Danny2097 added a commit that referenced this issue May 29, 2019

fixes to gitlab relating to issue #244

8b23255

Danny2097 added a commit that referenced this issue May 29, 2019

fixes to gitlab relating to issue #244

eb50692

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file/thread/socket descriptors leak #244

file/thread/socket descriptors leak #244

mhow2 commented May 27, 2019

blueoly commented May 27, 2019

mhow2 commented May 28, 2019

mhow2 commented May 28, 2019

creat89 commented May 28, 2019 •

edited

Loading

creat89 commented May 28, 2019 •

edited

Loading

blueoly commented May 29, 2019

creat89 commented May 29, 2019

tdegueul commented May 29, 2019

creat89 commented May 29, 2019 •

edited

Loading

tdegueul commented May 29, 2019

Danny2097 commented May 29, 2019

Danny2097 commented May 29, 2019 •

edited

Loading

tdegueul commented May 29, 2019

mhow2 commented May 31, 2019

creat89 commented May 31, 2019

mhow2 commented May 31, 2019

creat89 commented Jun 1, 2019 •

edited

Loading

blueoly commented Jun 2, 2019

creat89 commented Jun 3, 2019

mhow2 commented Jun 3, 2019

blueoly commented Jun 3, 2019

tdegueul commented Jun 3, 2019

blueoly commented Jun 3, 2019

tdegueul commented Jun 3, 2019

creat89 commented Jun 3, 2019

mhow2 commented Jun 5, 2019 •

edited

Loading

mhow2 commented Jun 5, 2019

file/thread/socket descriptors leak #244

file/thread/socket descriptors leak #244

Comments

mhow2 commented May 27, 2019

blueoly commented May 27, 2019

mhow2 commented May 28, 2019

mhow2 commented May 28, 2019

creat89 commented May 28, 2019 • edited Loading

creat89 commented May 28, 2019 • edited Loading

blueoly commented May 29, 2019

creat89 commented May 29, 2019

tdegueul commented May 29, 2019

creat89 commented May 29, 2019 • edited Loading

tdegueul commented May 29, 2019

Danny2097 commented May 29, 2019

Danny2097 commented May 29, 2019 • edited Loading

tdegueul commented May 29, 2019

mhow2 commented May 31, 2019

creat89 commented May 31, 2019

mhow2 commented May 31, 2019

creat89 commented Jun 1, 2019 • edited Loading

blueoly commented Jun 2, 2019

creat89 commented Jun 3, 2019

mhow2 commented Jun 3, 2019

blueoly commented Jun 3, 2019

tdegueul commented Jun 3, 2019

blueoly commented Jun 3, 2019

tdegueul commented Jun 3, 2019

creat89 commented Jun 3, 2019

mhow2 commented Jun 5, 2019 • edited Loading

mhow2 commented Jun 5, 2019

creat89 commented May 28, 2019 •

edited

Loading

creat89 commented May 28, 2019 •

edited

Loading

creat89 commented May 29, 2019 •

edited

Loading

Danny2097 commented May 29, 2019 •

edited

Loading

creat89 commented Jun 1, 2019 •

edited

Loading

mhow2 commented Jun 5, 2019 •

edited

Loading