Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ruby/dane #208

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Adds ruby/dane #208

wants to merge 15 commits into from

Conversation

jasonb5
Copy link

@jasonb5 jasonb5 commented Nov 8, 2024

Adds config/spack files for LLNL's ruby/dane machines.

@xylar
Copy link
Collaborator

xylar commented Nov 17, 2024

@jasonb5, this looks promising! I'll be making a new mache release (1.27.0) quite soon. Let me know if these additions are ready to be included in that and how you'd like me to review (since I don't have access to these machines).

Also, let me know if you need me to explain anything in mache, given that the documentation is so sparse. It seems like you've basically got it figured out but spack in particular can be pretty confusing and futzy.

@jasonb5
Copy link
Author

jasonb5 commented Nov 20, 2024

@xylar I'm hoping to get this wrapped up today, I'll be out of office until the new year. I'm just sorting out some machine specific issues. I'm not sure how to proceed with reviewing, @chengzhuzhang is testing it locally on ruby but dane is down at the moment. I don't think that's too much of a problem since the machines are very similar in environments.

@chengzhuzhang
Copy link

chengzhuzhang commented Nov 20, 2024

I'm not sure how to proceed with reviewing

I think we can keep this PR open until we can test on dane. @xylar I think we may have another mache release in concurrent with the February e3sm-unified release?

@chengzhuzhang
Copy link

@jasonb5 I think you must have a ton of things on your first day returning to work, though this one may need some attention because we do wanted to also testing the upcoming e3sm_unified on ruby and dane. I will follow up with an email on the list of things currently are not working well..Welcome back!

@xylar
Copy link
Collaborator

xylar commented Jan 16, 2025

@chengzhuzhang and @jasonb5, I'm wondering if the first attempt should just be like Polaris or Andes in that we don't try to build spack packages on these machines. The question is to what extent you all plan to use MPI-based software for Unified on these machines.

@chengzhuzhang
Copy link

chengzhuzhang commented Jan 16, 2025

@xylar, hey Xylar, there are currently two issues with Jason's e3sm-unified testing deployment that I noted, I'm copying content from my email to Jason:

two remaining issues to use e3sm-unified on Dane:
 
1. One is with ` ModuleNotFoundError: No module named 'ESMF'`;
2. The other is with running NCO with `mpi` (which I have a reproducer, in the following notes:

Details for 1. :

Traceback for the e3sm_to_cmip related ESMF load error:

Traceback (most recent call last):
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/esmpy/interface/loadESMF.py", line 26, in <module>
    esmfmk = os.environ["ESMFMKFILE"]
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/os.py", line 680, in __getitem__
    raise KeyError(key) from None
KeyError: 'ESMFMKFILE'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xesmf/util.py", line 8, in <module>
    import esmpy as ESMF
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/esmpy/__init__.py", line 108, in <module>
    from esmpy.api.esmpymanager import *
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/esmpy/api/esmpymanager.py", line 9, in <module>
    from esmpy.interface.cbindings import *
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/esmpy/interface/cbindings.py", line 13, in <module>
    from esmpy.interface.loadESMF import _ESMF
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/esmpy/interface/loadESMF.py", line 38, in <module>
    raise ImportError('The esmf.mk file cannot be found. Pass its path in the ESMFMKFILE environment variable.')
ImportError: The esmf.mk file cannot be found. Pass its path in the ESMFMKFILE environment variable.

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/bin/e3sm_to_cmip", line 6, in <module>
    from e3sm_to_cmip.__main__ import main
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/e3sm_to_cmip/__main__.py", line 25, in <module>
    from e3sm_to_cmip.cmor_handlers.utils import (
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/e3sm_to_cmip/cmor_handlers/utils.py", line 15, in <module>
    from e3sm_to_cmip.cmor_handlers.handler import VarHandler
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/e3sm_to_cmip/cmor_handlers/handler.py", line 12, in <module>
    import xcdat as xc
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xcdat/__init__.py", line 10, in <module>
    from xcdat.regridder.accessor import RegridderAccessor  # noqa: F401
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xcdat/regridder/__init__.py", line 1, in <module>
    from xcdat.regridder.accessor import RegridderAccessor
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xcdat/regridder/accessor.py", line 8, in <module>
    from xcdat.regridder import regrid2, xesmf, xgcm
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xcdat/regridder/xesmf.py", line 4, in <module>
    import xesmf as xe
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xesmf/__init__.py", line 3, in <module>
    from . import data, util
  File "/usr/WS1/e3sm/apps/e3sm-unified/base/envs/e3sm_unified_1.10.0_ruby/lib/python3.10/site-packages/xesmf/util.py", line 10, in <module>
    import ESMF
ModuleNotFoundError: No module named 'ESMF'

        srun: error: ruby826: task 0: Exited with exit code 1

Details for 2:

Still running ncclimo in background mode. 

    To reproduce the error running `ncclimo` with mpi:

    salloc --nodes=1 --account=e3smtest --time=00:30:00

    source /usr/workspace/e3sm/apps/e3sm-unified/load_latest_e3sm_unified_ruby.sh

    ncclimo --case=extendedOutput.v3.LR.historical_0101 --jobs=4 --thr=1 --parallel=mpi --yr_srt=2000 --yr_end=2014 --input=/usr/workspace/e3sm/zhang40/simulations/extendedOutput.v3.LR.historical_0101/archive/atm/hist --map=/usr/workspace/e3sm/data/diagnostics/maps/map_ne30pg2_to_cmip6_180x360_traave.20231201.nc --output=trash --regrid=output --prc_typ=eam

    The error message: “/usr/WS1/e3sm/apps/e3sm-unified/spack/e3sm_unified_1_10_0_ruby_intel_mvapich2/var/spack/environments/e3sm_unified_1_10_0_ruby_intel_mvapich2/.spack-env/view/bin/ncclimo: line 3234: mpirun: command not found”

For NCO it is possible to avoid using mpi, though it is pretty slow running in background mode to generate climatology. I'm not sure how difficult to build with spack. If it turns to be too time consuming to figure out, I think we can live with non-mpi version at this attempt.

@xylar
Copy link
Collaborator

xylar commented Jan 17, 2025

@jasonb5 and @chengzhuzhang, regarding the ESMFMKFILE issue, this environment variable should be set when the load script gets sourced. @jasonb5, is it possible that e3sm_to_cmip is launching scripts that are going back to your base conda environment (e.g. because you have conda initialization in your .bashrc)?

Folks who reported this issue to us on conda-forge were typically not activating the conda environment properly:
conda-forge/esmf-feedstock#91

@xylar
Copy link
Collaborator

xylar commented Jan 17, 2025

Regarding NCO, yes, you will only be able to run it with system MPI if you build it with spack. So if you want ruby and dane to be fully supported machines, you'll have to continue to debug all the spack stuff.

I have had a great deal of trouble building NCO with intel in the past, which is why we use gnu on most (all?) currently supported machines. But you and Charlie Zender may be able to debug any build issues that come up.

@xylar
Copy link
Collaborator

xylar commented Jan 17, 2025

In general, I think I would need a lot more context to help you debug both the ESMF/ESMPy issue and the NCO issue. Did you deploy without spack and comment out the compiler for e3sm_unified in the config files here? Or is this with spack deployment and things went wrong? Could you provide the workflow you used?

@jasonb5
Copy link
Author

jasonb5 commented Jan 21, 2025

@jasonb5 and @chengzhuzhang, regarding the ESMFMKFILE issue, this environment variable should be set when the load script gets sourced. @jasonb5, is it possible that e3sm_to_cmip is launching scripts that are going back to your base conda environment (e.g. because you have conda initialization in your .bashrc)?

Folks who reported this issue to us on conda-forge were typically not activating the conda environment properly: conda-forge/esmf-feedstock#91

I don't believe either of the accounts (deployment or personal) have any conda initialization in their .bashrc.

Regarding NCO, yes, you will only be able to run it with system MPI if you build it with spack. So if you want ruby and dane to be fully supported machines, you'll have to continue to debug all the spack stuff.

I have had a great deal of trouble building NCO with intel in the past, which is why we use gnu on most (all?) currently supported machines. But you and Charlie Zender may be able to debug any build issues that come up.

NCO was built with spack using the system MPI (intel + mpich2). Although the spack environment built all the packages we do get an error activating the environment:

==> Warning: could not load runtime environment due to RuntimeError: Trying to source non-existing file: /usr/tce/packages/intel-classic/intel-classic-2021.6.0-magic/compilers_and_libraries/linux/bin/compilervars.sh

If there has been more luck using GNU + OpenMPI, I'm not opposed to adding this as an option.

The other potential cause of NCO not working correctly is LC provides modules for compilers and mpi libraries but also provides system specific wrapper modules which apply some "magic". Currently I'm using the "magic" modules but I may try rebuilding everything using just base modules.

@xylar
Copy link
Collaborator

xylar commented Jan 21, 2025

Over the weekend, I had trouble with ESMF installing python via Spack. I updated our spack fork to not do that anymore. I don't think that was the cause of your ESMF problem but it might be worth trying a fresh spack build. You should be able to build using my branch of E3SM:
https://github.com/xylar/e3sm-unified/tree/update-to-1.11.0
You'll need to cherry-pick the update to mache 1.28.0 in this branch https://github.com/xylar/mache/tree/update-to-1.28.0 and then you should be able to run (from the E3SM-Unified branch):

cd e3sm_supported_machines
./deploy_e3sm_unified.py --tmpdir <some_scratch_dir> --conda ~/miniforge3 --mache_fork jasonb5/mache --mache_branch adds-lc-machines

I'd maybe set up spack yaml/sh/csh files for Gnu and OpenMPI first if that's not too hard. If you can do that, and then provide me details about what goes wrong from there (logs of what spack spits out, etc.) I should be able to help more.

I would recommend trying to get E3SM-Unified 1.11.0 working over 1.10.0, just because there might be a lot of time sunk into troubleshooting the old Unified that then immediately gets replaced.

@jasonb5
Copy link
Author

jasonb5 commented Jan 21, 2025

I'll try getting E3SM-Unified 1.11.0 working. I'll also add GNU and OpenMPI, shouldn't be too much of a problem.

Concerning NCO, I looked at the script and there's a lot of machine specific adjustments to the environment to find the correct bin path. I'm guessing not having this for dane/ruby may be why it cannot find mpirun. E3SM-Unified on a login node has mpirun under the conda environment, on a compute node it's under the spack environment. The system mvapich2 appears to not provide mpirun which is where I think NCO may be looking.

@chengzhuzhang
Copy link

Concerning NCO, I looked at the script and there's a lot of machine specific adjustments to the environment to find the correct bin path. I'm guessing not having this for dane/ruby may be why it cannot find mpirun. E3SM-Unified on a login node has mpirun under the conda environment, on a compute node it's under the spack environment. The system mvapich2 appears to not provide mpirun which is where I think NCO may be looking.

@jasonb5 thank you for troubleshooting further on dane/ruby. Regarding to adding machine-specific to NCO, I'm tagging @czender to advice..

@xylar
Copy link
Collaborator

xylar commented Jan 21, 2025

Concerning NCO, I looked at the script and there's a lot of machine specific adjustments to the environment to find the correct bin path. I'm guessing not having this for dane/ruby may be why it cannot find mpirun. E3SM-Unified on a login node has mpirun under the conda environment, on a compute node it's under the spack environment. The system mvapich2 appears to not provide mpirun which is where I think NCO may be looking.

Absolutely. If you want to make a PR to https://github.com/E3SM-Project/spack to also make it skip some of the madness for dane/ruby or just to clean up the madness more broadly, that would be welcome.

@xylar
Copy link
Collaborator

xylar commented Jan 21, 2025

@jasonb5, regarding ESMF, could you provide me with a small reproducer for that error? It could be a general problem and that environment variable may not be getting set properly in Spack.

@czender
Copy link

czender commented Jan 22, 2025

@jasonb5 @xylar and @chengzhuzhang Jason's understanding is correct. To elaborate further, NCO defaults to using mpirun -n ${mpi_number} for node management. A subset of specifically supported machines (e.g., Perlmutter) override this by automatically telling NCO to use srun -n $mpi_number. Internally NCO calls the generic command the MPI prefix which it stores in the $mpi_pfx shell variable. The user can explicitly set $mpi_pfx by invoking with the --mpi_pfx option as described here. Please try that first. I would also be happy to add additional automatically supported machines. I would need the $HOSTNAME values (need values for both login and compute nodes) for any new machines, and this would require a new NCO release. That in itself is not a problem, though the internal database of machine-specific $mpi_pfx values is currently about 10 machines long so I would ask that someone besides me test on the new machines since I do not scale well :) Either approach (invoking with --mpi_pfx=$mpi_pfx or me building the machine-specific command into the scripts) should solve the problem you experience on the new LLNL machines. Other approaches could also work. LMK how you would like to proceed.

@jasonb5
Copy link
Author

jasonb5 commented Jan 22, 2025

@czender Thank you, this is good to know. The current machines we're supporting are Slurm based. I've manually added an override to test, and it's now moving past the the original issue. Is there a way to change the invocation to use srun rather than mpirun? Or is this something that would need to be added to the internal database?

@czender
Copy link

czender commented Jan 22, 2025

You should be able to change from mpirun to srun with ncremap by invoking with ncremap --mpi_pfx=srun .... You can test that now. Unfortunately, I just realized that ncclimo does not support this flexibility because it lacks the --mpi_pfx option. It looks like getting ncclimo to use srun for a machine not in its internal database will require a new NCO release. I can change the default from mpirun to srun in NCO 4.3.2, now that all E3SM supported machines use srun. That should work for ruby and dane. That would require only a small modification. If ruby and dane require special options to srun then I would need to add them explicitly by name to the internal database, or implement full support for --mpi_pfx in ncclimo. In any case, you will need to figure out a way to test this since I do not have accounts on ruby and dane. How would you like me to proceed?

@czender
Copy link

czender commented Jan 23, 2025

The most recent NCO snapshot now defaults to managing nodes with srun instead of mpirun in both ncremap and ncclimo. This snapshot is installed in my bin directories on Perlmutter and Chrysalis. Only ncremap and ncclimo have changed, so maybe you can run a test that uses (copies of) those scripts and see if NCO with MPI works on ruby and dane.

@jasonb5
Copy link
Author

jasonb5 commented Jan 27, 2025

The most recent NCO snapshot now defaults to managing nodes with srun instead of mpirun in both ncremap and ncclimo. This snapshot is installed in my bin directories on Perlmutter and Chrysalis. Only ncremap and ncclimo have changed, so maybe you can run a test that uses (copies of) those scripts and see if NCO with MPI works on ruby and dane.

@czender This sounds like it should work for us, I'll grab the latest branch and test it out on ruby/dane. I don't think there's any other configuration required so this might be all we need. I'll report back once, I have a chance to test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants