You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running simulations on a cluster using Ubuntu. For models with a larger number of cells (around 9 million), we consistently encounter the same issues (see below). After testing several models, we’ve concluded that if the number of cells per mesh stays below 300,000, the simulation runs without problems.
This becomes problematic for larger models, as they require partitioning into a large number of meshes. For a model with 9 million cells, we need around 30 meshes, or sometimes even more, to make the simulation run.
When monitoring the processes, we noticed a significant amount of waiting time rather than actual simulation activity, likely due to the high number of meshes.
Do you have any suggestions on how to address this issue?
Here is the old discussion where we got other kind of messages, but could not solve the problem either: #13614
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 354486 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 354487 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 354488 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 24 PID 354489 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 25 PID 354490 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
The text was updated successfully, but these errors were encountered:
In the Discussion post that you cite, I describe how I ran your case on our system here at NIST. Is this the same input file? If not, post the input file and I will try it again.
Yes, this is the same file however the problem persists. I am not sure we are able to implement ulimit -s unlimited properly, or at least this does not solve the problem.
I ran this case again on a node on our cluster that is running Red Hat 9 linux. The job uses about 9 GB RAM. I am running with the latest source code. If you haven't already, check out the latest nightly build and try again.
We have 32 GB for our qmaster, q2 and q3 and 16 GB for q1.
The ulimit command ist already implemented. We will try to install the latest FDS version and see if the problem persists.
Hello,
We are running simulations on a cluster using Ubuntu. For models with a larger number of cells (around 9 million), we consistently encounter the same issues (see below). After testing several models, we’ve concluded that if the number of cells per mesh stays below 300,000, the simulation runs without problems.
This becomes problematic for larger models, as they require partitioning into a large number of meshes. For a model with 9 million cells, we need around 30 meshes, or sometimes even more, to make the simulation run.
When monitoring the processes, we noticed a significant amount of waiting time rather than actual simulation activity, likely due to the high number of meshes.
Do you have any suggestions on how to address this issue?
Here is the old discussion where we got other kind of messages, but could not solve the problem either: #13614
2025-01-27 15:17:25 : MPI version: 3.1
2025-01-27 15:17:25 : MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS
2025-01-27 15:17:25 :
2025-01-27 15:17:25 :
2025-01-27 15:17:25 : Job TITLE :
2025-01-27 15:17:25 : Job ID string : s70_02
2025-01-27 15:17:25 :
2025-01-27 15:18:01 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-01-27 15:18:01 : Image PC Routine Line Source
2025-01-27 15:18:01 : libc.so.6 00007AB8B6C42520 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 0000000007493899 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 000000000747D6D7 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 0000000007181782 Unknown Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 000000000040BA1D Unknown Unknown Unknown
2025-01-27 15:18:01 : libc.so.6 00007AB8B6C29D90 Unknown Unknown Unknown
2025-01-27 15:18:01 : libc.so.6 00007AB8B6C29E40 __libc_start_main Unknown Unknown
2025-01-27 15:18:01 : fds_openmp 000000000040B936 Unknown Unknown Unknown
2025-01-27 15:18:01 : Abort(457309071) on node 23 (rank 23 in comm 0): Fatal error in PMPI_Testall: Other MPI error, error stack:
2025-01-27 15:18:01 : PMPI_Testall(362)..............: MPI_Testall(count=4, req_array=0x7fff0383c410, flag=0x7fff0383c450, status_array=0x1) failed
2025-01-27 15:18:01 : MPIR_Testall_impl(44)..........:
2025-01-27 15:18:01 : MPIDI_Progress_test(95)........:
2025-01-27 15:18:01 : MPIDI_OFI_handle_cq_error(1100): OFI poll failed (ofi_events.c:1100:MPIDI_OFI_handle_cq_error:Input/output error)
2025-01-27 15:18:01 : Abort(994179983) on node 22 (rank 22 in comm 0): Fatal error in PMPI_Testall: Other MPI error, error stack:
2025-01-27 15:18:01 : PMPI_Testall(362)..............: MPI_Testall(count=6, req_array=0x7ffe0089a480, flag=0x7ffe0089a4d0, status_array=0x1) failed
2025-01-27 15:18:01 : MPIR_Testall_impl(44)..........:
2025-01-27 15:18:01 : MPIDI_Progress_test(95)........:
2025-01-27 15:18:01 : MPIDI_OFI_handle_cq_error(1100): OFI poll failed (ofi_events.c:1100:MPIDI_OFI_handle_cq_error:Input/output error)
or:
025-01-27 15:10:29 : MPI version: 3.1
2025-01-27 15:10:29 : MPI library version: Intel(R) MPI Library 2021.6 for Linux* OS
2025-01-27 15:10:29 :
2025-01-27 15:10:29 :
2025-01-27 15:10:29 : Job TITLE :
2025-01-27 15:10:29 : Job ID string : s70_02
2025-01-27 15:10:29 :
2025-01-27 15:11:23 : Time Step: 1, Simulation Time: 0.09 s
2025-01-27 15:11:28 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-01-27 15:11:28 : Image PC Routine Line Source
2025-01-27 15:11:28 : libc.so.6 0000749388042520 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 00000000073A5394 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000717EAE5 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040BA1D Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 0000749388029D90 Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 0000749388029E40 __libc_start_main Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040B936 Unknown Unknown Unknown
2025-01-27 15:11:28 : forrtl: severe (174): SIGSEGV, segmentation fault occurred
2025-01-27 15:11:28 : Image PC Routine Line Source
2025-01-27 15:11:28 : libc.so.6 000079DEF6642520 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 00000000073A5394 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000717EAE5 Unknown Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040BA1D Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 000079DEF6629D90 Unknown Unknown Unknown
2025-01-27 15:11:28 : libc.so.6 000079DEF6629E40 __libc_start_main Unknown Unknown
2025-01-27 15:11:28 : fds_openmp 000000000040B936 Unknown Unknown Unknown
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 354486 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 354487 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 354488 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 24 PID 354489 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 25 PID 354490 RUNNING AT q03
= KILLED BY SIGNAL: 9 (Killed)
The text was updated successfully, but these errors were encountered: