Skip to content
This repository has been archived by the owner on Dec 24, 2019. It is now read-only.

file/thread/socket descriptors leak #244

Open
mhow2 opened this issue May 27, 2019 · 27 comments
Open

file/thread/socket descriptors leak #244

mhow2 opened this issue May 27, 2019 · 27 comments
Assignees

Comments

@mhow2
Copy link

mhow2 commented May 27, 2019

As already mentioned in Slack in the general channel:

I'm running the regular docker stack with one slave.
Two analysis in // , year 2018, all metrics on projects https://github.com/INRIA/spoon/ and https://gitlab.ow2.org/clif/clif-legacy

Once started, I gather both MP's PIDs 10959 (w1 clif-legacy), 10985 (w2 spoon) and run:

watch 'sudo lsof -p 10959 -p 10985 |wc ; sudo lsof -p 10959 -p 10985 |grep sock |wc ; ps -o thcount 10959 10985'

So I get 1) the number of file descriptors (any kind) 2) the number of sockets descriptors 3) thread count on both.

After some minutes I observe that:

There are 2904 file descriptors opened for both PIDs , 1490 of those are sockets (not bound, not used). w1 gets 13007 threads and w2 381.

If I wait more, it will eventual reach 32000 threads, get java.lang.OutOfMemoryError: unable to create new native thread in the MP's logs and loose any control to the server, getting messages in active SSH session that the system cannot fork new processes.

@mhow2 mhow2 changed the title file (or socket) descriptors leak file/thread/socket descriptors leak May 27, 2019
blueoly added a commit that referenced this issue May 27, 2019
@blueoly
Copy link
Contributor

blueoly commented May 27, 2019

If my code was to blame for this issue, then aac36db should fix it. @mhow2 please check it.

@mhow2
Copy link
Author

mhow2 commented May 28, 2019

Unfortunately, it doesn't solve the issue. I'm still hitting several thousand of threads. Did you test it on your side ?

On my side, I did produce a stack trace please have a look.

@mhow2
Copy link
Author

mhow2 commented May 28, 2019

I just reproduced the issue with https://gitlab.ow2.org/clif/clif-legacy you can have a look to this project specifically when testing.

@creat89
Copy link
Contributor

creat89 commented May 28, 2019

I was able to reproduce the issue. Here is the things that I used to reproduce it:

  • Project: https://gitlab.ow2.org/clif/clif-legacy
  • Date range: 01/01/2018 - 31/12/2018
  • How was reproduced: Using Eclipse in Debug Mode (The next post will be in normal mode)
  • Metrics used: Only one, org.eclipse.scava.metricprovider.trans.plaintextprocessing.PlainTextProcessingTransMetricProvider as Martin indicated us that the metrics were not the issue.
  • Description:
    The issue started to become noticeable when the delta to analyze was the 15/03/2018. The logs shown by Eclipse showed that the process was on creating delta, however, Eclipse was quite slow, so the printing of the logs might have been late. I had to stop the process as I was running out of RAM, and the CPU was around 100% on a i7.
    I include the jfr file thas was generated using the JMC (you can open it with JMC). As you will observe I had to stop everything aroung 14:52.
    According to the Hot Packages, the issue was related to org.eclipse.jdt.internal.debug.core.model 1,858 77.611%

flight_recording_180172TheJVMRunningMissionControl_1.zip

EDIT: I'm running the task this time not in debug mode and the day of breaking has changed. I'm past over the 15/03/18

@creat89
Copy link
Contributor

creat89 commented May 28, 2019

I was able to reproduce the issue again, but this time in normal mode instead of debug mode. Here is the things that I used to reproduce it:

  • Project: https://gitlab.ow2.org/clif/clif-legacy
  • Date range: 01/01/2018 - 31/12/2018
  • How was reproduced: Using Eclipse
  • Metrics used: Only one, org.eclipse.scava.metricprovider.trans.plaintextprocessing.PlainTextProcessingTransMetricProvider as Martin indicated us that the metrics were not the issue.
  • Description:
    The issue started to become noticeable when the delta started to analyze the April 2018.
    I include the jfr file thas was generated using the JMC (you can open it with JMC).

flight_recording_180172TheJVMRunningMissionControl_2.zip

image

@blueoly
Copy link
Contributor

blueoly commented May 29, 2019

I took a look at the recordings that Adrian sent. The recordings do not show that there is an excessive use of memory. I could not come to a conclusion.

@creat89
Copy link
Contributor

creat89 commented May 29, 2019

I will test today with other readers. Just to know whether the issue comes from the platform or from gitlab reader.

@tdegueul
Copy link

Hi everyone,

After closer inspection, I may have found the problem based on two observations:

  • Thread explosion coincide with Delta creation
  • The problematic threads are of the form pool-X-thread-1, with X reaching the tens of thousands

This means that during Delta creation, the platform is creating thousands of Thread pools that all contain a single thread. Diving a bit into the code, I think the following components (bug tracking systems managers) are affected: BitbucketManager, GitLabManager, EclipseForumsManager.

For instance, have a look at https://github.com/crossminer/scava/blob/dev/metric-platform/platform-extensions/org.eclipse.scava.platform.bugtrackingsystem.gitlab/src/org/eclipse/scava/platform/bugtrackingsystem/gitlab/GitLabManager.java#L381-L382, which is invoked when creating the Deltas:

ScheduledExecutorService newClientService = Executors.newSingleThreadScheduledExecutor();
newClientService.scheduleAtFixedRate(runnable, 1, 1, TimeUnit.MINUTES);

This thread pool does not terminate. It creates a single-thread pool that indefinitely runs every minute. I am not sure why this is needed in the first place (is this some kind of 'retry' mechanism? why does it only start after 1 minute and not right away?). So we end up stacking pools of threads that are never closed.

This code is copy/pasted in other bug tracking managers. @creat89 do you think this might be the issue? If so, may you have a look and implement a mechanism to somehow shutdown the pool, or just remove it and do the same thing without relying on threads?

The GitHub manager doesn't reuse the same code so it shouldn't be affected. @mhow2 is it? If so, there might be another issue to fix.

There are other strange things going on in the profiler, but one step at a time ;)

@creat89
Copy link
Contributor

creat89 commented May 29, 2019

Hello @tdegueul, I cannot tell exactly as I'm not the developer of that code, it is @Danny2097. I agree that the thread should be terminated and yes, the goal is to wait until we can have more calls to do the requests. (@Danny2097?) The tests have been done using OW2's Gitlab server, which do not set any type of limit in the number of requests, thus, this shouldn't affect (I guess).
With regards GitHub, according to @mhow2, this has been observed also in the GitHub reader. But I haven't arrived to test enough GitHub (as I get an error but from RESTMULE). I'll try again today.

This morning I tested with Bugzilla and Redmine, and nothing similar was found. However, I was thinking that maybe the projects were small to reproduce the error. I tried to use the NTTP reader, but the channels that are large enough are taking too long even to create the first delta.

@tdegueul
Copy link

I'm not sure when the code is actually invoked.

The method public void setClient(GitLabTracker gitlabTracker) throws IOException { is kinda weird. The method recursively calls itself (?) through the use of threads (?) (ie. creates a Runnable that calls the method that creates the Runnable), and, if there is no "too many requests" (open_id.contains("Too Many Requests") == false), runs this Runnable periodically.

I have a very hard time understanding what it does, and indeed more tests are needed to understand if this is really what's happening. But I'm pretty sure there is something wrong here. Under certain circumstances (no idea which -- maybe when there's a "too many requests" happening), you may end up having thousands of threads triggered at a given moment, which will create thousands of threads in turn since they call themselves, etc.

@Danny2097
Copy link
Member

The idea behind the scheduled executer is to provide a mechanism to the readers which when the call counts reach zero or when a Oauth token expires it will pause until the either a new token is generated or the call counter resets.

I am open to other suggestions to implement this kind of functionality into the readers.

@Danny2097
Copy link
Member

Danny2097 commented May 29, 2019

Following discussions with @tdegueul and @creat89 I have removed the problematic code from the following readers gitlab bitbucket in commits eb50692, 8b23255, d68be58 . We are currently working on a fix for eclipse forums as this supports OAuth authentication.

@tdegueul
Copy link

It seems this was the issue and clif-legacy isn't running wild on my machine anymore... Please let us know @mhow2!

@mhow2
Copy link
Author

mhow2 commented May 31, 2019

Today I have deployed the new build and tried twice on the same project clif-legacy, whole 2018, all metrics.

The first time I got crazy a load average (like 100) so I stopped the platform.

Now that I'm running the second try, it is not "wild" but I can tell for sure there is still an issue somewhere. Currently at 63% of the analysis, there are about 4000 threads running on the server and around 1700 sockets opened by the related java process.

@creat89
Copy link
Contributor

creat89 commented May 31, 2019

@mhow2 Which metrics are you running? Are you using all the metrics?

@mhow2
Copy link
Author

mhow2 commented May 31, 2019

All metrics.
I got clif-legacy analysis to finish ! ... Well it's finished but I still see 2700 sockets open and about 5000 threads.

@creat89
Copy link
Contributor

creat89 commented Jun 1, 2019

hello @mhow2, I have pushed a commit that might be also related to this issue: feecc8d

Because the indexing metrics were being executed at the same time than the transient metrics, instead at the end. Le me know if you see any changes.

@blueoly
Copy link
Contributor

blueoly commented Jun 2, 2019

Hello. I just faced the following error, while I run all the metrics for a period for around a month at this repo https://github.com/ruby/ruby .

https://gist.github.com/blueoly/5d8d8d77ffe9b8a7e216a045dc74456c

@creat89
Copy link
Contributor

creat89 commented Jun 3, 2019

@blueoly, I had that issue but here crossminer/scava-deployment#63. I'm not sure it's related to this issue. I thought that it was related to using a local generated version of the metric platform.

@mhow2
Copy link
Author

mhow2 commented Jun 3, 2019

I'm just posting this two pics to illustrate the fact that the thread/memory allocation is growing even after the task (on clif-legacy) is reported as finished in the admin UI.
threads-pinpoint=1559283364,1559553796
memory-pinpoint=1559285956,1559546884

@blueoly
Copy link
Contributor

blueoly commented Jun 3, 2019

Hi @creat89 ,

I saw your issue but in my case I noticed a memory related warning before the error stack trace and this is the reason I put it here.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000678f00000, 665321472, 0) failed; error='Cannot allocate memory' (errno=12)

I do not know if this is the case for you too.

@tdegueul
Copy link

tdegueul commented Jun 3, 2019

@mhow2 @blueoly If you're using the latest version of scava-deployment, the port 8192 should now be exposed by oss-app to enable remote debugging and profiling. Do you have public instances of Scava where the problem occurs and the port 8192 is exposed that I could connect to?

@blueoly
Copy link
Contributor

blueoly commented Jun 3, 2019

It was not a public instance. I will ask my admin to open that port and I will come back to you.

@tdegueul
Copy link

tdegueul commented Jun 3, 2019

I forgot: first, you need to update the -Djava.rmi.server.hostname=localhost in https://github.com/crossminer/scava-deployment/blob/dev/metric-platform/Dockerfile with your own public hostname.

@creat89
Copy link
Contributor

creat89 commented Jun 3, 2019

Yes, in my case it works. I guess I used the wrong port on Saturday

@mhow2
Copy link
Author

mhow2 commented Jun 5, 2019

So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018

The analysis hang at 7% on historic metric. JMX is available at scava-dev.ow2.org:8192

@mhow2
Copy link
Author

mhow2 commented Jun 5, 2019

So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018

The analysis hang at 7% on historic metric. JMX is available at scava-dev.ow2.org:8192

This observation doesn't seem related with the current leak issue. Please ignore it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants