You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mjharriso had the idea of implementing a way to checkpoint the restart memory regularly. Then when the model crashes an exception handler can access the saved memory and write out a restart.
@Hallberg-NOAA outlined a way that this could be done. Just make another instance of the restart_CS, instead of it containing pointers to model field arrays, it should contain pointers to allocated memory. The checkpointing routine would copy over all the latest data pointed to by the restart_CS into allocated memory.
The checkpoint would be written out by calling into the regular MOM_restart interface using the checkpoint instance of the restart_CS.
What's not clear is how/whether the exception handler can have access to the necessary checkpoint restart_CS, and other program memory needed to dump a restart.
If we are going to write MPI exception handlers it would also be worth adding something to dump a stack trace. e.g. intel compilers have tracebackqq().
The text was updated successfully, but these errors were encountered:
@mjharriso had the idea of implementing a way to checkpoint the restart memory regularly. Then when the model crashes an exception handler can access the saved memory and write out a restart.
@Hallberg-NOAA outlined a way that this could be done. Just make another instance of the restart_CS, instead of it containing pointers to model field arrays, it should contain pointers to allocated memory. The checkpointing routine would copy over all the latest data pointed to by the restart_CS into allocated memory.
The checkpoint would be written out by calling into the regular MOM_restart interface using the checkpoint instance of the restart_CS.
What's not clear is how/whether the exception handler can have access to the necessary checkpoint restart_CS, and other program memory needed to dump a restart.
If we are going to write MPI exception handlers it would also be worth adding something to dump a stack trace. e.g. intel compilers have tracebackqq().
The text was updated successfully, but these errors were encountered: