Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with large models in Ubuntu cluster simulations #14135

Open
quantumfds opened this issue Jan 29, 2025 · 5 comments
Open

Performance issues with large models in Ubuntu cluster simulations #14135

quantumfds opened this issue Jan 29, 2025 · 5 comments
Assignees

Comments

@quantumfds
Copy link

quantumfds commented Jan 29, 2025

Hello,

We are running simulations on a cluster using Ubuntu. For models with a larger number of cells (around 9 million), we consistently encounter the same issues (see below). After testing several models, we’ve concluded that if the number of cells per mesh stays below 300,000, the simulation runs without problems.

This becomes problematic for larger models, as they require partitioning into a large number of meshes. For a model with 9 million cells, we need around 30 meshes, or sometimes even more, to make the simulation run.

When monitoring the processes, we noticed a significant amount of waiting time rather than actual simulation activity, likely due to the high number of meshes.

Do you have any suggestions on how to address this issue?

Here is the old discussion where we got other kind of messages, but could not solve the problem either: #13614

2025-01-27 15:17:25 : MPI version: 3.1
2025-01-27 15:17:25 : MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS
2025-01-27 15:17:25 :
2025-01-27 15:17:25 :
2025-01-27 15:17:25 : Job TITLE :
2025-01-27 15:17:25 : Job ID string : s70_02
2025-01-27 15:17:25 :
2025-01-27 15:18:01 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-01-27 15:18:01 : Image PC Routine Line Source
2025-01-27 15:18:01 : libc.so.6 00007AB8B6C42520 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 0000000007493899 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 000000000747D6D7 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 0000000007181782 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 000000000040BA1D Unknown Unknown Unknown
2025-01-27 15:18:01 : libc.so.6 00007AB8B6C29D90 Unknown Unknown Unknown
2025-01-27 15:18:01 : libc.so.6 00007AB8B6C29E40 __libc_start_main Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 000000000040B936 Unknown Unknown Unknown
2025-01-27 15:18:01 : Abort(457309071) on node 23 (rank 23 in comm 0): Fatal error in PMPI_Testall: Other MPI error, error stack:
2025-01-27 15:18:01 : PMPI_Testall(362)..............: MPI_Testall(count=4, req_array=0x7fff0383c410, flag=0x7fff0383c450, status_array=0x1) failed
2025-01-27 15:18:01 : MPIR_Testall_impl(44)..........:
2025-01-27 15:18:01 : MPIDI_Progress_test(95)........:
2025-01-27 15:18:01 : MPIDI_OFI_handle_cq_error(1100): OFI poll failed (ofi_events.c:1100:MPIDI_OFI_handle_cq_error:Input/output error)
2025-01-27 15:18:01 : Abort(994179983) on node 22 (rank 22 in comm 0): Fatal error in PMPI_Testall: Other MPI error, error stack:
2025-01-27 15:18:01 : PMPI_Testall(362)..............: MPI_Testall(count=6, req_array=0x7ffe0089a480, flag=0x7ffe0089a4d0, status_array=0x1) failed
2025-01-27 15:18:01 : MPIR_Testall_impl(44)..........:
2025-01-27 15:18:01 : MPIDI_Progress_test(95)........:
2025-01-27 15:18:01 : MPIDI_OFI_handle_cq_error(1100): OFI poll failed (ofi_events.c:1100:MPIDI_OFI_handle_cq_error:Input/output error)

or:

025-01-27 15:10:29 : MPI version: 3.1
2025-01-27 15:10:29 : MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS
2025-01-27 15:10:29 :
2025-01-27 15:10:29 :
2025-01-27 15:10:29 : Job TITLE :
2025-01-27 15:10:29 : Job ID string : s70_02
2025-01-27 15:10:29 :
2025-01-27 15:11:23 : Time Step: 1, Simulation Time: 0.09 s
2025-01-27 15:11:28 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-01-27 15:11:28 : Image PC Routine Line Source
2025-01-27 15:11:28 : libc.so.6 0000749388042520 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 00000000073A5394 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000717EAE5 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040BA1D Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 0000749388029D90 Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 0000749388029E40 __libc_start_main Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040B936 Unknown Unknown Unknown
2025-01-27 15:11:28 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-01-27 15:11:28 : Image PC Routine Line Source
2025-01-27 15:11:28 : libc.so.6 000079DEF6642520 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 00000000073A5394 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000717EAE5 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040BA1D Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 000079DEF6629D90 Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 000079DEF6629E40 __libc_start_main Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040B936 Unknown Unknown Unknown

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 354486 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 354487 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 354488 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 24 PID 354489 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 25 PID 354490 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)

@mcgratta mcgratta self-assigned this Jan 29, 2025
@mcgratta
Copy link
Contributor

In the Discussion post that you cite, I describe how I ran your case on our system here at NIST. Is this the same input file? If not, post the input file and I will try it again.

@quantumfds
Copy link
Author

Yes, this is the same file however the problem persists. I am not sure we are able to implement ulimit -s unlimited properly, or at least this does not solve the problem.

@mcgratta
Copy link
Contributor

How much memory (RAM) does this computer have?

@mcgratta
Copy link
Contributor

mcgratta commented Feb 3, 2025

I ran this case again on a node on our cluster that is running Red Hat 9 linux. The job uses about 9 GB RAM. I am running with the latest source code. If you haven't already, check out the latest nightly build and try again.

@quantumfds
Copy link
Author

We have 32 GB for our qmaster, q2 and q3 and 16 GB for q1.
The ulimit command ist already implemented. We will try to install the latest FDS version and see if the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants