Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export LD_LIBRARY_PATH in the job_script_prologue by default #40

Open
ikrommyd opened this issue Dec 18, 2024 · 6 comments · May be fixed by #42
Open

Export LD_LIBRARY_PATH in the job_script_prologue by default #40

ikrommyd opened this issue Dec 18, 2024 · 6 comments · May be fixed by #42

Comments

@ikrommyd
Copy link

ikrommyd commented Dec 18, 2024

If a package that requires newer libstdc++ is installed via pip in the coffea images, then ONLY in the workers, the library path resolution points to /lib64/libstdc++.so.6: instead of the newer one that is installed via conda under /usr/loca/lib.
Therefore you will get this in the workers:

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/local/lib/python3.11/site-packages/pyarrow/lib.cpython-311-x86_64-linux-gnu.so)

cause from som reason the library path resolution of the workers is differently.

Therefore the user would need to add job_script_prologue=["export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH"] in the cluster setup.
It may be a good idea to add it ourselves by default. One option is to add it under:

job-script-prologue: []
, the other option is to add it in the code so that the export is always added to the job_script_prologue list no matter what else the user wants to specify there.

@ikrommyd
Copy link
Author

I actually think the second one is best cause if the user specifies a different job_script_prologue in code, then the default one gets overwritten and without the export it will not work.

@nsmith-
Copy link
Member

nsmith- commented Dec 18, 2024

I'm rather reluctant to modify the library path globally. Do we have any indication what the difference is between worker and client environment?

@nsmith-
Copy link
Member

nsmith- commented Dec 18, 2024

It seems that the client and worker disagree about which image they are using, because the ldd output from

def baz():
    import subprocess

    path = "/usr/local/lib/python3.10/site-packages/awkward_cpp/lib/libawkward.so"
    return subprocess.getoutput(f"ldd -v {path}")

print(baz())
print("="*30)
for k, v in client.run(baz).items():
    print(k)
    print(v)

includes

        Version information:
        /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/libawkward.so:
                libgcc_s.so.1 (GCC_3.0) => /lib64/libgcc_s.so.1
                libpthread.so.0 (GLIBC_2.2.5) => /lib64/libpthread.so.0
                libm.so.6 (GLIBC_2.2.5) => /lib64/libm.so.6
                libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
                libstdc++.so.6 (GLIBCXX_3.4.19) => /lib64/libstdc++.so.6
                libstdc++.so.6 (CXXABI_1.3) => /lib64/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.11) => /lib64/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.9) => /lib64/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.15) => /lib64/libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4) => /lib64/libstdc++.so.6

on the client side and

        /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/libawkward.so:
                libgcc_s.so.1 (GCC_3.0) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libgcc_s.so.1
                libm.so.6 (GLIBC_2.2.5) => /lib64/libm.so.6
                libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
                libstdc++.so.6 (GLIBCXX_3.4.15) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.20) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (CXXABI_1.3.8) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.11) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (CXXABI_1.3.9) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.29) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.26) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.9) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (CXXABI_1.3) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.21) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4.19) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6
                libstdc++.so.6 (GLIBCXX_3.4) => /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/../../../../libstdc++.so.6

on the worker side.

It turns out that what is happening is the symlink /cvmfs/unpacked.cern.ch/registry.hub.docker.com/coffeateam/coffea-dask-almalinux8:latest is not resolving the same on worker and client. For example, the md5sum of the libawkward is different! A quick fix is to determine the real path of the symlink on the client side and use that as the +SingularityImage argument to HTCondor.

@nsmith-
Copy link
Member

nsmith- commented Dec 18, 2024

.. or not. Even though

return subprocess.getoutput(f"md5sum {path}")

returns different values:

34646393a2178e83b1f1976e33853275  /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/libawkward.so
==============================
tcp://131.225.189.27:10000
a820955f6dbfa322c546f7639ab553f6  /usr/local/lib/python3.10/site-packages/awkward_cpp/lib/libawkward.so

the realpath of the image:

return subprocess.getoutput(f"realpath /cvmfs/unpacked.cern.ch/registry.hub.docker.com/coffeateam/coffea-dask-almalinux8:latest")

returns the same:

/cvmfs/unpacked.cern.ch/.flat/a8/a89bdf03a4f6019fdbc2b9fc8f0c9e46ffaab1832c57df2245f5b4bfa588a251
==============================
tcp://131.225.188.157:10000
/cvmfs/unpacked.cern.ch/.flat/a8/a89bdf03a4f6019fdbc2b9fc8f0c9e46ffaab1832c57df2245f5b4bfa588a251

nsmith- added a commit that referenced this issue Dec 18, 2024
Also resolve symlink in image path
Closes #32 and #40
@nsmith-
Copy link
Member

nsmith- commented Dec 18, 2024

Ah nevermind, I think it was a stale file handle on the client. Restarting the ./shell made everything consistent and I cannot see the error anymore. I will anyway add the realpath to the image to prevent the potential for client-worker issues in the future.

@nsmith- nsmith- linked a pull request Dec 18, 2024 that will close this issue
@nsmith- nsmith- linked a pull request Dec 18, 2024 that will close this issue
@ikrommyd
Copy link
Author

ikrommyd commented Dec 19, 2024

Thanks for the checks Nick. Yeah I also saw this difference in the ldd output (I believe I sent it on Slack but maybe not). I didn't inversigate further though. I'll test #42 with pip installed awkward in the images. How do I test a PR of lpcjobqueue though? Do I just install it from this branch in the image after starting the ./shell and then just ship it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants