Skip to content

Casper CESM Workflow

gdicker1 edited this page Jul 24, 2020 · 1 revision

Note that the USERNAME, the name of the CESM repository, and CASENAME must be decided by the user. REMOTE_NAME, GITHUB_LINK, and BRANCH_TO_CHECKOUT are decided either by the user or provided to them.


Setting up CESM

While logged into Casper, clone the directory with common files into your scratch directory. Also recommended to clone your CESM repository here as well.

cd /glade/scratch/{USERNAME}
git clone https://github.com/BrileyJ/MOM6_WorldShared.git

Clone CESM while checking out a specific branch and checkout external repos.

git clone -b cesm2_2_alpha04d_mom6 https://github.com/ESCOMP/CESM.git
(cd CESM/; ./manage_externals/checkout_externals -o)

You may have to cancel part way through due to lack of access to a sub-repo of CAM

Navigate to the mom directory and checkout the most recent version of the master branch of MOM_Interface

cd CESM/components/mom/
git checkout origin/master

If you run git log right now, you should at least have commit: 26e798cc8b93378f8a4d5163adf4d6c1099ac488 and some newer commits.

If you are directly working on porting, this is the point where you are going to get your fork of the MOM6 repo and the branch you are working on. For other work, you should have some repository and branch name combination given to you.

cd MOM6
git remote add {REMOTE_NAME} {GITHUB_LINK}
git fetch {REMOTE_NAME}

The results from the fetch command should show the list of branches that you can now checkout locally.

git checkout {REMOTE_NAME}/{BRANCH_TO_CHECKOUT}

After the above checkout, you should now see the most recent commit made to the branch just checked out. Now go up three directories to be at your CESM directory, and then go into cime/scripts for further steps.

cd ../../..
cd cime/scripts.

Now we need to update the CIME repository and branch being used.

git remote add gddWork https://github.com/gdicker1/cime.git
git fetch gddWork
git checkout gddWork/add_DAV

Case Creating & Setup

--- All steps after this point are case specific ---

Start off by creating a new case using the T62_t061 resolution, the CMOM component set, and the NTDD0002 project code.

./create_newcase --run-unsupported --res T62_t061 --compset CMOM --project NTDD0002 --case {CASENAME}
cd {CASENAME}

Copy over the PE layout file that dictates how threads are allocated for any runs submitted and then update XML variables to turn off a dependent job, reduce the amount of file writes performed, and update the walltime for the job.

cp /glade/scratch/{USERNAME}/MOM6_WorldShared/MachPes/env_mach_1n_Casper ./env_mach_pes.xml
./xmlchange DOUT_S=False
./xmlchange OCN_DIAG_MODE=none
./xmlchange OCN_DIAG_SECTIONS=False
./xmlchange USER_REQUESTED_WALLTIME=01:30:00

Then, insert the line <arg flag="--reservation=TDD_4xV100"/> between after line 48 in env_batch.xml. This sets us up to use a dedicated node with 4 NVIDIA V100 GPUs. Now setup the case with the changes we made.

./case.setup --reset

If you run ./preview_run at this point you should look at the CASE INFO section to verify there is 1 node and 36 total tasks. You should also look at the SUBMIT CMD section to confirm you are using reservation NTDD0002 and requesting 1.5 hours (01:30:00) for the walltime.

Now build MOM6, this process can take about 5 to 10 minutes. It can be helpful to request an interactive session to reduce the amount of load put on the login node.

./case.build --skip-provenance-check

After the build finishes you can submit a run with

./case.submit

Other Useful Info

On Casper you can check a submitted job's status with squeue -u {USERNAME} and on Cheyenne with qstat -u {USERNAME}.

You can cancel running jobs with scancel {JOBNUM} or qdel {JOBNUM} depending on your system.

Whether you're on Cheyenne or Casper, you can check run.{CASENAME} in your case directory to see if a run finished correctly or not. If you see resubmit_num 0 at the end and no ERROR message directing you to a log file, the run was most likely successful. You should also check cesm.log.{JOBNUM} and logfile.000000.out in your /glade/scratch/{USERNAME}/{CASENAME}/run directory, there should be timing info at the bottom of each if a run was successful.

Most relevant timing info can be found in the logfile.000000.out. Some extra info can be found in your case directory (the one with case.submit, case.build, etc in it) in the timing/cesm_timing.{CASENAME}.{JOBNUM} file.

You can also check the build log (bldlog) files in /glade/scratch/{USERNAME}/{CASENAME}/bld for info about how each part was compiled (especially the ocn.bldlog file). If a build is successful, the bldlog files will end in .gz, but can still be viewed with Vim.

Within your case directory and cime/scripts directory, the functions there (e.g. xmlquery, create_newcase, case.setup) are well documented and can be run with the --help command to learn more about them.

You can adjust the run length by modifying the STOP_OPTION and STOP_N XML variables in your case directory. Valid values for STOP_OPTION are [none, never, nsteps, nstep, nseconds, nsecond, nminutes, nminute, nhours, nhour, ndays, nday, nmonths, nmonth, nyears, nyear, date, ifdays0, end] and the default is ndays. STOP_N is an intger number of STOP_OPTION type events before a run ends.

Clone this wiki locally