Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint restart memory in case of crash #180

Open
nichannah opened this issue May 21, 2015 · 1 comment
Open

Checkpoint restart memory in case of crash #180

nichannah opened this issue May 21, 2015 · 1 comment

Comments

@nichannah
Copy link
Collaborator

@mjharriso had the idea of implementing a way to checkpoint the restart memory regularly. Then when the model crashes an exception handler can access the saved memory and write out a restart.

@Hallberg-NOAA outlined a way that this could be done. Just make another instance of the restart_CS, instead of it containing pointers to model field arrays, it should contain pointers to allocated memory. The checkpointing routine would copy over all the latest data pointed to by the restart_CS into allocated memory.

The checkpoint would be written out by calling into the regular MOM_restart interface using the checkpoint instance of the restart_CS.

What's not clear is how/whether the exception handler can have access to the necessary checkpoint restart_CS, and other program memory needed to dump a restart.

If we are going to write MPI exception handlers it would also be worth adding something to dump a stack trace. e.g. intel compilers have tracebackqq().

@nichannah
Copy link
Collaborator Author

Perhaps another good thing to do within an MPI exception handler would be to dump the FP exception register.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants