-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel Granite Rapids (Xeon 6980P 128-Core / 256-Thread) - HYDU_create_process (Too many open files) #45
Comments
This sounds a lot like an MPI hostname issue — for your
And when you run the benchmark, did you just run |
Worst case, though, you can nuke the build folder ( Usually if it hits Oh one more thing, just for context — is it running on Ubuntu or some other distro? |
Yes, I am running Hosts file is setup for local [127.0.0.1] as well. I've tried Ps + Qs configuration as default [1/4], also [1/256] and [2/128] with no effect. I nuked the /opt/top500 folder and ran again. Same result. |
@CraftComputing - Can you try running the benchmark manually?
I wonder if there's some output Ansible's eating up that may be helpful debugging this. Also for completeness, can you post the contents of the
|
hosts.ini:
mpirun -f cluster-hosts ./xhpl:
cat /opt/top500/tmp/hpl-2.3/bin/top500/cluster-hosts
cat /opt/top500/tmp/hpl-2.3/bin/top500/HPL.dat:
|
@CraftComputing - Thanks! I'm going to boot my Ampere machine and double check a couple things. I think it may be what's in the Can you check the contents of your |
For comparison, my files:
And I can confirm I can ping my system's mDNS name OR local IP and get a result:
Can you confirm the same on your system? I wonder if you might have a network setup that is causing mpich to be angry :( |
Hosts file... 10.0.0.179 is my local IP address
I can ping both the local IP |
Another idea, since you have Hyperthreading... can you modify |
Changed 512 to 256, and still hanging at the same spot.
|
It sounds like my script needs a little updating, specifically the
I'll look at a better option for that. (Or maybe have it switch depending on architecture?) Separately, since you mentioned (separately) that switching the count from On the AmpereOne, tweaking that made almost a 40% improvement, but it's architecture is vastly different than the generic |
Opened a follow-up issue: #46 For now, we can just act like Ansible doesn't exist anymore and run the command manually :) |
Attempting to run Top500 Playbook on a 2P Granite Rapids server. Single server test.
Benchmark errors out with "HYDU_create_process (lib/utils/launch.c:24): pipe error (Too many open files)"
I have attempted to workaround the issue by allowing more active processes to run on the host (ulimit -n 4096). Sometimes that will result this same error... other times, the script will hang at "TASK [Run the benchmark]", but never progress.
The text was updated successfully, but these errors were encountered: