-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ruby/dane #208
base: main
Are you sure you want to change the base?
Adds ruby/dane #208
Conversation
@jasonb5, this looks promising! I'll be making a new mache release (1.27.0) quite soon. Let me know if these additions are ready to be included in that and how you'd like me to review (since I don't have access to these machines). Also, let me know if you need me to explain anything in mache, given that the documentation is so sparse. It seems like you've basically got it figured out but spack in particular can be pretty confusing and futzy. |
@xylar I'm hoping to get this wrapped up today, I'll be out of office until the new year. I'm just sorting out some machine specific issues. I'm not sure how to proceed with reviewing, @chengzhuzhang is testing it locally on ruby but dane is down at the moment. I don't think that's too much of a problem since the machines are very similar in environments. |
I think we can keep this PR open until we can test on |
@jasonb5 I think you must have a ton of things on your first day returning to work, though this one may need some attention because we do wanted to also testing the upcoming e3sm_unified on ruby and dane. I will follow up with an email on the list of things currently are not working well..Welcome back! |
@chengzhuzhang and @jasonb5, I'm wondering if the first attempt should just be like Polaris or Andes in that we don't try to build spack packages on these machines. The question is to what extent you all plan to use MPI-based software for Unified on these machines. |
@xylar, hey Xylar, there are currently two issues with Jason's e3sm-unified testing deployment that I noted, I'm copying content from my email to Jason:
Details for 1. :
Details for 2:
For |
@jasonb5 and @chengzhuzhang, regarding the Folks who reported this issue to us on conda-forge were typically not activating the conda environment properly: |
Regarding NCO, yes, you will only be able to run it with system MPI if you build it with spack. So if you want ruby and dane to be fully supported machines, you'll have to continue to debug all the spack stuff. I have had a great deal of trouble building NCO with intel in the past, which is why we use gnu on most (all?) currently supported machines. But you and Charlie Zender may be able to debug any build issues that come up. |
In general, I think I would need a lot more context to help you debug both the ESMF/ESMPy issue and the NCO issue. Did you deploy without spack and comment out the compiler for e3sm_unified in the config files here? Or is this with spack deployment and things went wrong? Could you provide the workflow you used? |
I don't believe either of the accounts (deployment or personal) have any conda initialization in their
NCO was built with spack using the system MPI (intel + mpich2). Although the spack environment built all the packages we do get an error activating the environment:
If there has been more luck using GNU + OpenMPI, I'm not opposed to adding this as an option. The other potential cause of NCO not working correctly is LC provides modules for compilers and mpi libraries but also provides system specific wrapper modules which apply some "magic". Currently I'm using the "magic" modules but I may try rebuilding everything using just base modules. |
Over the weekend, I had trouble with ESMF installing python via Spack. I updated our spack fork to not do that anymore. I don't think that was the cause of your ESMF problem but it might be worth trying a fresh spack build. You should be able to build using my branch of E3SM:
I'd maybe set up spack yaml/sh/csh files for Gnu and OpenMPI first if that's not too hard. If you can do that, and then provide me details about what goes wrong from there (logs of what spack spits out, etc.) I should be able to help more. I would recommend trying to get E3SM-Unified 1.11.0 working over 1.10.0, just because there might be a lot of time sunk into troubleshooting the old Unified that then immediately gets replaced. |
I'll try getting E3SM-Unified 1.11.0 working. I'll also add GNU and OpenMPI, shouldn't be too much of a problem. Concerning NCO, I looked at the script and there's a lot of machine specific adjustments to the environment to find the correct bin path. I'm guessing not having this for dane/ruby may be why it cannot find mpirun. E3SM-Unified on a login node has |
@jasonb5 thank you for troubleshooting further on dane/ruby. Regarding to adding machine-specific to NCO, I'm tagging @czender to advice.. |
Absolutely. If you want to make a PR to https://github.com/E3SM-Project/spack to also make it skip some of the madness for dane/ruby or just to clean up the madness more broadly, that would be welcome. |
@jasonb5, regarding ESMF, could you provide me with a small reproducer for that error? It could be a general problem and that environment variable may not be getting set properly in Spack. |
@jasonb5 @xylar and @chengzhuzhang Jason's understanding is correct. To elaborate further, NCO defaults to using |
@czender Thank you, this is good to know. The current machines we're supporting are Slurm based. I've manually added an override to test, and it's now moving past the the original issue. Is there a way to change the invocation to use srun rather than mpirun? Or is this something that would need to be added to the internal database? |
You should be able to change from |
The most recent NCO snapshot now defaults to managing nodes with |
@czender This sounds like it should work for us, I'll grab the latest branch and test it out on |
Adds config/spack files for LLNL's ruby/dane machines.