Skip to content

ELM on cades, or condo: environments, building, and running

Fengming Yuan edited this page Mar 10, 2022 · 5 revisions

ELM (ALM formerly) on CADES, OR-CONDO: environments, building, and running

UPDATED: 2022-03-10 for the most recent E3SM master with modified user-defined PFT

**** NOTE: this is for CCSI users. ****

I. Environments

I-1. TWO shared spaces (CCSI users)

 /software/user_tools/current/cades-ccsi/

This is for group accessible tools.

Currently for PFLOTRAN, we have gcc compiler, with parallel support (OPENMPI), built PETSc (PETSC_DIR=/software/user_tools/current/cades-ccsi/petsc4pf/openmpi-1.10-gcc-5.3 for version 3.8x; OR, PETSC_DIR=/software/user_tools/current/cades-ccsi/petsc-x/openmpi-1.10-gcc-5.3 for version 3.9x or above).

 /lustre/or-scratch/cades-ccsi/

this is for group usable file space. Hereafter, it's referred by $CCSI_DIR. Users shall have a folder under this.

LOGIN node: or-slurm-login.ornl.gov

i.e. access by ssh or-slurm-login.ornl.gov

Note: ./proj-shared, ./world-shared, and ./scratch are for what they are supposed to.

I-2. ENV suggestions

(1) It's suggested to edit your own .bashrc, when first time login.

(2) Suggested .bashrc for ELM currently as following,


# the following are with gcc-5.3.0, openmpi-1.10.3 (default)
module purge
module unload PE-intel
module unload PE-pgi
module unload PE-gnu
module load PE-gnu/1.0

module load boost
module load cmake
module load zlib

# needs blas-lapack
module load mkl/2018.1.163
export BLASLAPACK_LIBDIR=/software/dev_tools/swtree/or-condo/mkl/2018.1.163/centos7.5_binary/lib

# parallell-enabled hdf5 with openmpi-1.10.3/intel2016
module load hdf5-parallel/1.8.17

export HDF5_PATH=/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/hdf5-parallel/1.8.17/centos7.2_gnu5.3.0
export PATH=$HDF5_PATH/bin:$PATH
export LD_LIBRARY_PATH=$HDF5_PATH/lib:$LD_LIBRARY_PATH

# netcdf built with openmpi-1.10.3/gnu and support of hdf5 (i.e. netcdf-4)
module load netcdf-hdf5parallel/4.3.3.1
export NETCDF_PATH=/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/netcdf-hdf5parallel/4.3.3.1/centos7.2_gnu5.3.0

export PATH=$NETCDF_PATH/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF_PATH/lib:$LD_LIBRARY_PATH

module load pnetcdf
export PNETCDF_PATH=/software/dev_tools/swtree/cs400_centos7.2_pe2016-08/pnetcdf/1.9.0/centos7.2_gnu5.3.0
export PATH=$PNETCDF_PATH/bin:$PATH
 export ELM_PFLOTRAN_SOURCE_DIR=$CLM_PFLOTRAN_SOURCE_DIR
export PETSC_PATH=$PETSC_DIR
export LD_LIBRARY_PATH=$PETSC_PATH/lib:$LD_LIBRARY_PATH

module load perl

#nco
#module load nco/4.6.9 # this system version has issues with ncap
export UDUNITS2_PATH=$CCSI_USERTOOLS/udunits2
export PATH=$UDUNITS2_PATH/bin:$PATH
export GSL_ROOT=$CCSI_USERTOOLS/nco/gsl
export PATH=$GSL_ROOT/bin:$PATH
export ANTLR_ROOT=$CCSI_USERTOOLS/nco/antlr
export PATH=$ANTLR_ROOT/bin:$PATH
export NCOPATH=$CCSI_USERTOOLS/nco/nco-4.7.9
export PATH=$NCOPATH/bin:$PATH

# ncl
module load ncl
 
module rm python
module load python/3.6.3

# user-installed python packages (.local/lib/python3.6/site-packages/)
export PY_USER_PATH=~/.local/lib/python3.6/site-packages
export PYTHONPATH=$PY_USER_PATH:$PYTHONPATH

I-3. E3SM codes & Input data

for CADES machine settings

There is specific branch for our system, https://github.com/E3SM-Project/E3SM.git, branch fmyuan/ornl-cades-machine-update trunk E3SM

E3SM input data on CADES

$CCSI_DIR/proj-shared/project_acme/e3sm-inputdata/

We try to keep a minimal but full copy of E3SM inputdata in this folder, because most CCSI users don't have access right to E3SM data repository.

Additionally, for site-level or project-specific purpose, the following site-specific input data may be used.
/nfs/data/ccsi/proj-shared/E3SM/inputdata

+++++++++++++++++++++++++++++++++++

II. CASE 1: single-point user-defined datm-clm (3 Examples).

-----------------------------------------------------------------

NOTE: For ELM CUSTOMIZED configuration with options of CPL_BYPASS & User-defined PFT

NOTE: For this example and other to work, it is needed to clone pt-e3sm-inputdata repo

II-1. ELM only test, with user-defined DATM dataset

  • STEP 1. Create a PTCLM case
  1) cd $E3SM_MODEL_DIR/cime/scripts

  2) ./create_newcase --case /lustre/or-scratch/cades-ccsi/proj-shared/project_acme/cases/testELM_US-Brw_I1850CNPRDCTCBC --res ELM_USRDAT --mach cades --compiler gnu --compset I1850CNPRDCTCBC --project ccsi --walltime 00:30:00
  • STEP 2. Build the case created in STEP 1 (user-defined 1 point dataset, namely '1x1pt_US-Brw' from 1985-2015)
  3) cd /lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cases/testELM_US-Brw_I1850CNPRDCTCBC

  4) ./xmlchange, OR, vi env_run.xml
             --id SAVE_TIMING --val FALSE
             --id DATM_MODE --val CLM1PT
             --id DATM_CLMNCEP_YR_START --val 1985
             --id DATM_CLMNCEP_YR_END --val 2015
             --id CLM_USRDAT_NAME --val 1x1pt_US-Brw
 
 ADDTIONALLY (not availble in E3SM trunk), 
 if 'accelerated spinup', you need to edit 'env_run.xml' as following:
             --id CLM_ACCELERATED_SPINUP --val on

 if 'cold startup', you need to edit 'env_run.xml' as following:
             --id CLM_FORCED_COLDSTART --val on


  5) ./case.setup 
(./case.setup --clean FOR cleaning-up)

  6) vi user_nl_datm
Edit this file, add the following line in the last (NOTE: this is a bug when doing spinup setup from CESM)
taxmode = "cycle", "extend", "extend"
(NOTE: The first "cycle" mode implies that use climate data repeatedly for COMPSET I1850XXX).


  7) ./case.build
(./case.build --clean      #cleaning-up ALL ACME model components
 ./case.build --clean lnd  #cleaning-up LND component
 ./case.build --clean-all  #cleaning-up ALL model components and External components)
  • STEP 3. Run the case
  ./case.submit

NOTES: *(a) the run direcory is:$CCSI_DIR/scratch/$USER/testELM_US-Brw_I1850CNPRDCTCBC/run

*(b) The run length is 5 days, with monthly output, i.e. NO real data write out EXCEPT for restart files at the end.

*(c) If you'd like what-ever time-length of the simulation, editing env_run.xml, for an example, like the following:

  --id STOP_OPTION nyear
  --id STOP_N 600
 (These editing will let model run 600 years)

------------------------------------------------------------------------

II-2. RUN with PFLOTRAN coupled (BGC only in this case)

  • (a) Additional OPTIONS for coupled code compiling with PFLOTRAN

  • **edit env_run.xml as following:

   --id CLM_INTERFACE_MODE --val pflotran 
  • **edit env_build.mxl as following:
   --id CLM_USE_PETSC --val TRUE 
  • **edit env_mach_specific.xml as following:
  --id CLM_PFLOTRAN_COUPLED --val TRUE
  --id CLM_PFLOTRAN_COLMODE --val TRUE # only IF vertical transport/flow, e.g. global simulation
  --id CLM_PFLOTRAN_SOURCE_DIR --val lustre/or-hydra/cades-ccsi/proj-shared/models/pflotran-interface/src/clm-pflotran # modify as needed
  • (b) COMPILING, clean the build and compile the model as usual. But before this, you need to make sure libpflotran.a has been compiled in the CLM_PFLOTRAN_SOURCE_DIR, as instructed here.

  • (c) RUNNING, run the model as usual. For this case, edit env_run.xml to run 5 model-years:

  --id STOP_OPTION nyear
  --id STOP_N 5
  • NOTE: if appending use_pflotran = .false. and use_clm_interface = .false. into file user_nl_clm, the model will run like NOT-PFLOTRAN-coupled, although it's fully compiled*

--------------------------------------------------------------------------------

II-3. ELM only, with user-defined PFT, 1 point Grid-level GSWP3v1, 6 Ecotypes at NGEE-Arctic Kougarok Sites.

  1) cd $CCSI_DIR/proj-shared/models/e3sm/cime/scripts

  2) ./create_newcase --case /lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cases/ELMuserpft_Kougarok_I1850CNPRDCTCBC --res CLM_USRDAT --mach cades --compiler gnu --compset I1850CNPRDCTCBC --project ccsi --walltime 00:30:00
  • STEP 2. Configure case just created in STEP 1: (1) 1 grid-level atm data (1901-2010); (2) 6 user-defined point datasets for NGEE-Arctic Kougarok Sites; (3) user-defined Arctic PFTs.
  3) cd /lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cases/ELMuserpft_Kougarok_I1850CNPRDCTCBC

  4) ./xmlchange, OR, vi env_run.xml, to modify configurations as following:
             --id SAVE_TIMING --val FALSE
             --id DATM_MODE --val CLMGSWP3v1
             --id DATM_CLMNCEP_YR_START --val 1901
             --id DATM_CLMNCEP_YR_END --val 2010
             --id CLM_USRDAT_NAME --val 1x1pt_kougarok-GRID

             --id ATM_DOMAIN_PATH --val /lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cesminput_ngee/share/domains/domain.clm
             --id LND_DOMAIN_PATH --val /lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cesminput_ngee/share/domains/domain.clm
             --id ATM_DOMAIN_FILE --val domain.lnd.51x63pt_kougarok-NGEE_TransA_navy.nc
             --id LND_DOMAIN_FILE --val domain.lnd.51x63pt_kougarok-NGEE_TransA_navy.nc
             --id DIN_LOC_ROOT_CLMFORC --val /lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cesminput_ngee/atm/datm7 

ADDTIONALLY (not availble in E3SM trunk), if 'accelerated spinup', you need to edit 'env_run.xml' as following:

             --id CLM_ACCELERATED_SPINUP --val on

if 'cold startup', you need to edit 'env_run.xml' as following:

             --id CLM_FORCE_COLDSTART --val on
  • STEP 3. Configure model run parallelism setting

By default, the model run on CADES is upon some pre-configured setting. For an example, the case setup by now will run model on 1 node with 1 MPI Task. So that's not good, so vi env_mach_pes.xml as following so that model runs on 1 node with 6 MPI Tasks

  5) vi env_mach_pes.xml

    <entry id="NTASKS">
      <type>integer</type>
      <values>
        <value compclass="ATM">6</value>
        <value compclass="CPL">6</value>
        <value compclass="OCN">6</value>
        <value compclass="WAV">6</value>
        <value compclass="GLC">6</value>
        <value compclass="ICE">6</value>
        <value compclass="ROF">6</value>
        <value compclass="LND">6</value>
        <value compclass="ESP">1</value>
      </values>
      <desc>number of tasks for each component</desc>
    </entry> 

NOTE: by case-creating in STEP 1, the model run on CADES for maximal 30 minutes (Wallclocktime). For longer simulation, we need to modify that. SO vi env_batch.xml as following for 'group id="case.run" ', for an example, to run the model up to 48 hours (note: this is the wallclock time allowable currently on CADES):

    <entry id="JOB_WALLCLOCK_TIME" value="48:00:00">
      <type>char</type>
      <valid_values/>
      <desc>The machine wallclock setting.  Default determined in config_machines.xml can be overwritten by testing</desc>
    </entry>
    <entry id="JOB_QUEUE" value="batch">
      <type>char</type>
      <valid_values/>
      <desc>The machine queue in which to submit the job.  Default determined in config_machines.xml can be overwritten by testing</desc>
    </entry>
    <entry id="USER_REQUESTED_WALLTIME" value="48:00:00">
      <type>char</type>
      <desc>Store user override for walltime</desc>
    </entry>

  • STEP 4. Build the case configured in STEP 2-3
  6) ./case.setup 
(./case.setup --clean FOR cleaning-up setting-up)

The model setup will not be successful, with error relevant to missing 'fsurdat' file and its associated PFT parameter file. So, vi file user_nl_clm by adding the following:

 fsurdat = '/lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cesminput_ngee/lnd/clm2/surfdata_map/surfdata_51x63pt_kougarok-NGEE_TransA_simyr1850_c181115-sub12.nc'
 paramfile = '/lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cesminput_ngee/lnd/clm2/paramdata/clm_params_c180524-sub12.nc'
 nyears_ad_carbon_only = 25
 spinup_mortality_factor = 10

NOTE: the last 2 options are optional, but which can accelerate spinup much better.

ONE model edit is the 'maxpft' (which default is 17), by vi env_run.xml, modifying CLM_BLDNML_OPTS, like following

    <entry id="CLM_BLDNML_OPTS" value="-maxpft 12 -bgc bgc -nutrient cnp -nutrient_comp_pathway rd  -soil_decomp ctc -methane -nitrif_denitrif ">
      <type>char</type>
      <desc>CLM build-namelist options</desc>
    </entry>

AND, finally build the model, as following

  7) ./case.build
(./case.build --clean      #cleaning-up ALL ACME model components
 ./case.build --clean lnd  #cleaning-up LND component
 ./case.build --clean-all  #cleaning-up ALL model components and External components)
  • STEP 5. Run the case
  ./case.submit

NOTES: *(a) the run direcory is:$CCSI_DIR/scratch/$USER/ELMuserpft_Kougarok_I1850CNPRDCTCBC/run

*(b) The run length is 5 days, with monthly output, i.e. NO real data write out EXCEPT for restart files at the end.

*(c) If you'd like what-ever time-length of the simulation, editing env_run.xml, for an example, like the following:

  --id STOP_OPTION nyear
  --id STOP_N 600
 (These editing will let model run 600 years)

NOTE: you probably need to modify JOB_WALLCLOCK_TIME in env_batch.xml mentioned above in STEP 3.

*(d) If you'd like what-ever time-interval of produce RESTART files, again editing env_run.xml, for an example, like the following:

  --id REST_OPTION nyear
  --id REST_N 100
 (These editing will let model output restart files every 100 years)

*(e) If your run like that of 600 years may not finish in the wallclock time, you need to editing env_run.xml to resume the model from the LAST saved restart point, like the following:

  --id CONTINUE_RUN TRUE

TIPS: Since we configured to save 'restart' files every 100 years, the model will resume from THE LAST save point. It's critical for user, therefore, to configure 'REST_N', 'STOP_N', and 'JOB_WALLCLOCK_TIME' all properly, so that the run would save both IO and run-time. Normally those restart files are large and its frequent writing cost time.

*(f) AFTER you're done with ACCELERATED_SPINUP (aka ad_spinup), you need to run normal spinup, by vi env_run.xml as following:

             --id CLM_ACCELERATED_SPINUP --val off
             --id CLM_FORCE_COLDSTART --val off

AND, adding initial state into user_nl_clm, as following:

  finidat = './ELMuserpft_Kougarok_I1850CNPRDCTCBC.clm2.r.0601-01-01-00000.nc'

Here, we assume the file ELMuserpft_Kougarok_I1850CNPRDCTCBC.clm2.r.0601-01-01-00000.nc is in the directory: $CCSI_DIR/scratch/$USER/ELMuserpft_Kougarok_I1850CNPRDCTCBC/run.

THEN, issue command: ./case.submit as before. This will run model for another 600 years as normal spinup stage. The newly generated ELMuserpft_Kougarok_I1850CNPRDCTCBC.clm2.r.0601-01-01-00000.nc will be used as ```finidat`` for TRANSIENT run (1850 - 2010 in our case here).

  • STEP 6. Run the case for TRANSIENT stage (1850-2010)

THIS TIME you need to: (1) create a fresh case with --compset I20TRCNPRDCTCBC; (2) configure case with -id STOP_N -val 160 years.

AND, adding initial state into user_nl_clm, as following:

  finidat = '$CCSI_DIR/scratch/$USER/ELMuserpft_Kougarok_I1850CNPRDCTCBC/run/ELMuserpft_Kougarok_I1850CNPRDCTCBC.clm2.r.0601-01-01-00000.nc'

(NOTE: you MUST explicitly expand the $CCSI_DIR and $USER into real directory name; OR, copy that finitdat file into your transient run directory).

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

III. CASE 2: a global case of datm-clm for running 5 days

  • STEP 1. Create a case
cd $CCSI_DIR/proj-shared/models/ACME-fmyuan/cime/script/

./create_newcase --case $CCSI_DIR/proj-shared /project_acme/cases/testalm_hcru_I1850CLM45CN --res hcru_hcru --mach cades --compiler gnu --mpilib openmpi --compset I1850CLM45CN --walltime 1:00:00 --project ccsi

This command will setup a case in the following directory:

$CCSI_DIR/proj-shared/project_acme/cases/testalm_hcru_I1850CLM45CN

if you want to change this case name and directory, modify the command option above following “--case”. Usually you should put this in your home directory, like: --case $HOME/project_acme/cases/????

This command will also create a model build directory and run directory under:

$CCSI_DIR/scratch/$USER/testalm_hcru_I1850CLM45CN

It implies that your executive and libraries are in folder ./bld of that, and model run directory is in folder ./run of that.

  • STEP 2. Build a case, under the case directory.
cd $CCSI_DIR/proj-shared/project_acme/cases/testalm_hcru_I1850CLM45CN

./xmlchange: env_run.xml
             --id STOP_N --val 1
             --id STOP_OPTION --val ndays
(This is for USER AS NEEDED, for examples the above edits will modify model to run 1 (STOP_N) day (STOP_OPTION).)

./case.setup

./case.build

This will build the model in your bld directory as mentioned in ending of STEP 1. You can modify it in ‘env_build.xml’, with ‘entry_id’ of EXEROOT or CESMSCRATCHROOT.

  • STEP 3. Run case, under the case directory.

TIP: if don't want to write out CLM netcdf files (usually it's slow on HPC, which not good for code testing), you may want to turn off all output by adding a line of text into user_nl_clm:

  hist_empty_htapes = .true.

Then, run model by issuing command:

./case.submit

It shall run the model under the scratch root for computing node. In this case it is:

$CCSI_DIR/scratch/$USER/testalm_hcru_I1850CLM45CN/run

FINALLY, adding user-defined LULC into user_nl_clm, as following:

 flanduse_timeseries = '/lustre/or-hydra/cades-ccsi/proj-shared/project_acme/cesminput_ngee/lnd/clm2/surfdata_map/landuse.timeseries_51x63pt_kougarok-NGEE_TransA_simyr1850_c181115-sub12.nc'

+++++++++++++++++++++++++++++++++++++++++++++++

IV. DEBUGGING

DDT Debug on CADES

Clone this wiki locally