-
Notifications
You must be signed in to change notification settings - Fork 13
file/thread/socket descriptors leak #244
Comments
Unfortunately, it doesn't solve the issue. I'm still hitting several thousand of threads. Did you test it on your side ? On my side, I did produce a stack trace please have a look. |
I just reproduced the issue with https://gitlab.ow2.org/clif/clif-legacy you can have a look to this project specifically when testing. |
I was able to reproduce the issue. Here is the things that I used to reproduce it:
flight_recording_180172TheJVMRunningMissionControl_1.zip EDIT: I'm running the task this time not in debug mode and the day of breaking has changed. I'm past over the 15/03/18 |
I was able to reproduce the issue again, but this time in normal mode instead of debug mode. Here is the things that I used to reproduce it:
|
I took a look at the recordings that Adrian sent. The recordings do not show that there is an excessive use of memory. I could not come to a conclusion. |
I will test today with other readers. Just to know whether the issue comes from the platform or from gitlab reader. |
Hi everyone, After closer inspection, I may have found the problem based on two observations:
This means that during Delta creation, the platform is creating thousands of Thread pools that all contain a single thread. Diving a bit into the code, I think the following components (bug tracking systems managers) are affected: For instance, have a look at https://github.com/crossminer/scava/blob/dev/metric-platform/platform-extensions/org.eclipse.scava.platform.bugtrackingsystem.gitlab/src/org/eclipse/scava/platform/bugtrackingsystem/gitlab/GitLabManager.java#L381-L382, which is invoked when creating the Deltas:
This thread pool does not terminate. It creates a single-thread pool that indefinitely runs every minute. I am not sure why this is needed in the first place (is this some kind of 'retry' mechanism? why does it only start after 1 minute and not right away?). So we end up stacking pools of threads that are never closed. This code is copy/pasted in other bug tracking managers. @creat89 do you think this might be the issue? If so, may you have a look and implement a mechanism to somehow shutdown the pool, or just remove it and do the same thing without relying on threads? The GitHub manager doesn't reuse the same code so it shouldn't be affected. @mhow2 is it? If so, there might be another issue to fix. There are other strange things going on in the profiler, but one step at a time ;) |
Hello @tdegueul, I cannot tell exactly as I'm not the developer of that code, it is @Danny2097. I agree that the thread should be terminated and yes, the goal is to wait until we can have more calls to do the requests. (@Danny2097?) The tests have been done using OW2's Gitlab server, which do not set any type of limit in the number of requests, thus, this shouldn't affect (I guess). This morning I tested with Bugzilla and Redmine, and nothing similar was found. However, I was thinking that maybe the projects were small to reproduce the error. I tried to use the NTTP reader, but the channels that are large enough are taking too long even to create the first delta. |
I'm not sure when the code is actually invoked. The method I have a very hard time understanding what it does, and indeed more tests are needed to understand if this is really what's happening. But I'm pretty sure there is something wrong here. Under certain circumstances (no idea which -- maybe when there's a "too many requests" happening), you may end up having thousands of threads triggered at a given moment, which will create thousands of threads in turn since they call themselves, etc. |
The idea behind the scheduled executer is to provide a mechanism to the readers which when the call counts reach zero or when a Oauth token expires it will pause until the either a new token is generated or the call counter resets. I am open to other suggestions to implement this kind of functionality into the readers. |
It seems this was the issue and |
Today I have deployed the new build and tried twice on the same project The first time I got crazy a load average (like 100) so I stopped the platform. Now that I'm running the second try, it is not "wild" but I can tell for sure there is still an issue somewhere. Currently at 63% of the analysis, there are about 4000 threads running on the server and around 1700 sockets opened by the related java process. |
@mhow2 Which metrics are you running? Are you using all the metrics? |
All metrics. |
Hello. I just faced the following error, while I run all the metrics for a period for around a month at this repo https://github.com/ruby/ruby . https://gist.github.com/blueoly/5d8d8d77ffe9b8a7e216a045dc74456c |
@blueoly, I had that issue but here crossminer/scava-deployment#63. I'm not sure it's related to this issue. I thought that it was related to using a local generated version of the metric platform. |
Hi @creat89 , I saw your issue but in my case I noticed a memory related warning before the error stack trace and this is the reason I put it here. OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000678f00000, 665321472, 0) failed; error='Cannot allocate memory' (errno=12) I do not know if this is the case for you too. |
It was not a public instance. I will ask my admin to open that port and I will come back to you. |
I forgot: first, you need to update the |
Yes, in my case it works. I guess I used the wrong port on Saturday |
So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018 The analysis hang at 7% on |
This observation doesn't seem related with the current leak issue. Please ignore it. |
As already mentioned in Slack in the general channel:
I'm running the regular docker stack with one slave.
Two analysis in // , year 2018, all metrics on projects https://github.com/INRIA/spoon/ and https://gitlab.ow2.org/clif/clif-legacy
Once started, I gather both MP's PIDs 10959 (w1 clif-legacy), 10985 (w2 spoon) and run:
watch 'sudo lsof -p 10959 -p 10985 |wc ; sudo lsof -p 10959 -p 10985 |grep sock |wc ; ps -o thcount 10959 10985'
So I get 1) the number of file descriptors (any kind) 2) the number of sockets descriptors 3) thread count on both.
After some minutes I observe that:
There are 2904 file descriptors opened for both PIDs , 1490 of those are sockets (not bound, not used). w1 gets 13007 threads and w2 381.
If I wait more, it will eventual reach 32000 threads, get
java.lang.OutOfMemoryError: unable to create new native thread
in the MP's logs and loose any control to the server, getting messages in active SSH session that the system cannot fork new processes.The text was updated successfully, but these errors were encountered: