Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient cores leading to a deadlock or application hang #229

Open
diegolovison opened this issue Jan 6, 2025 · 3 comments
Open

Insufficient cores leading to a deadlock or application hang #229

diegolovison opened this issue Jan 6, 2025 · 3 comments

Comments

@diegolovison
Copy link
Contributor

diegolovison commented Jan 6, 2025

The application utilizes a Semaphore referred to as blockingSemaphore to control a shared resource.

AbstractShell.java

if(seconds > 0){
  acquired = blockingSemaphore.tryAcquire(seconds,TimeUnit.SECONDS);
} else {
    blockingSemaphore.acquire();//released in the observer
}
blockingConsumer = (response) -> {
    blockingResponse.setLength(0);
    blockingResponse.append(response);
    blockingSemaphore.release();
};

The ScheduledThreadPoolExecutor is configured with a limited number of cores.
Previous code: new ScheduledThreadPoolExecutor(Runtime.getRuntime().availableProcessors()/2,new ThreadFactory() {
New code: new ScheduledThreadPoolExecutor(getMinimumScheduleCorePoolSize(),new ThreadFactory() {

private static int getMinimumScheduleCorePoolSize() {
    int cores = Runtime.getRuntime().availableProcessors() / 2;
    if (cores < 3) {
        return 3;
    } else {
        return cores;
    }
}

The blockingSemaphore.acquire() and blockingSemaphore.release() are both created by the ScheduledThreadPoolExecutor mentioned above.

A situation like the following will hang if corePoolSize of the ScheduledThreadPoolExecutor is equal to 2.

scripts:
  doit:
    - sh: date
      then:
        - set-state: RUN.date
  update:
    - sh: echo 'uno' >> /tmp/foo.txt
    - sleep: 2s
    - sh: echo 'dos' >> /tmp/foo.txt
    - sleep: 2s
    - sh: echo 'tres' >> /tmp/foo.txt
    - set-state: RUN.update true
  foo:
    - sleep: 15s
    - sh: tail -f /tmp/foo.txt
      watch:
        - regex: dos
          then:
            - script:
                name: doit
                async: true
        - regex: tres
          then:
            - ctrlC
    - set-state: RUN.foo true
hosts:
  local: "<Insert host value here>"
roles:
  doit:
    hosts: [local]
    run-scripts: [update, foo]

If the number of cores is equals 2 (new ScheduledThreadPoolExecutor(2,new ThreadFactory() {) and you run the YAML above, two threads will be in the blockingSemaphore.acquire() state. Because of that, there is no space for the third thread calls the blockingSemaphore.release() and the application will hang forever.

Increasing the number of cores (e.g., to 3) resolves the issue temporarily but introduces the potential for another scenario where three cores calls the blockingSemaphore.acquire(), leading to a hang. This approach merely postpones the issue without addressing its root cause.

@willr3
Copy link
Collaborator

willr3 commented Jan 8, 2025

I can find 3 areas where tasks are submitted to the ScheduledThreadPoolExecutor:

  1. The callback from SuffixStream. This occurs when qDup detects the prompt at the end of the stream from the remote shell. qDup uses a callback with a small delay because if there are additional characters from the remote stream then the prompt is not the end of the output from the current command. This delay catches cases like sh: echo "$PS1" && doSomething.sh The callback also runs the Sh.postRun() which calls AbstractShell.shSync to capture the exit code. The shSync call is what uses blockingSemaphore to ensure only one call to the remote shell at a time.
  2. The sleep command pushes a task into the queue to pass execution onto the next command in the script after the desired delay. This is done through the Context
  3. The timer on a command (usually on an sh) queues a task in the pool with the specified wait duration.

I think what you've uncovered is an error in the design. The callback from the suffix stream calling shSync means we need 2 free threads in the pool to complete the callback. 1 for running the callback with shSync and then the second for the other callback that identifies the prompt at the end of the shSync command.
I think we need to move the core callback that identifies the prompt at the end of ouptut into a separate thread pool and then execute the postRun portion of the callback in the pool with the user commads.

Does my rambling make sense or should I go back to staring at the code some more?

@willr3
Copy link
Collaborator

willr3 commented Jan 17, 2025

fda54c5
6b0a3f2

The above two commits move the SuffixStream callback to a dedicated threadpool which I believe will prevent shSync from causing thread starvation. Do you want to take a look at the changes and tell me if you think they are sufficient to prevent the deadlock?

@diegolovison
Copy link
Contributor Author

Hi Will,

It seems ok to me after your changes. I ran a test with #236
The issue is reproducible locally and on GHA when getMinimumScheduleCorePoolSize is equals 2.

  • When using the GHA the Runtime.getRuntime().availableProcessors() is 4. By dividing by 2, becomes 2.
  • Locally you can return a fixed value equals 2 for getMinimumScheduleCorePoolSize

IMHO, you can now create a new PR removing the method getMinimumScheduleCorePoolSize and if CI is happy ( I guess it will be because I tested ) we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants