Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0 #239

Merged
merged 24 commits into from
Jan 17, 2024

Conversation

quintinwang5
Copy link
Contributor

@quintinwang5 quintinwang5 commented Jan 11, 2024

Update to oneapi toolkit 2024 and update to torch 2.1.0.
They should be updated at the same time because ipex 1.13 package has dynamic link to libraries in oneapi 2023.

@pbchekin
Copy link
Contributor

@quintinwang5 did it work locally?

@quintinwang5
Copy link
Contributor Author

@quintinwang5 did it work locally?

Yes. I have oneapi 2024.0 locally. It works well.

@quintinwang5
Copy link
Contributor Author

quintinwang5 commented Jan 12, 2024

@pbchekin Seems oneapi 2023 is still used. And there are something wrong in oneapi environment variables. Should I change someting on the CI machine?

PATH: /home/waihungt/.local/bin:/home/whitney/bin:/home/waihungt/.vscode-server/bin/1a5daa3a0231a0fbba4f14db7ec463cf99d7768e/bin/remote-cli:/home/waihungt/.local/bin:/home/waihungt/bin:/home/whitney/bin:/usr/DPA/tools/oneAPI/2023.2.0/vtune/2023.2.0/bin64:/usr/DPA/tools/oneAPI/2023.2.0/mpi/2021.10.0//libfabric/bin:/usr/DPA/tools/oneAPI/2023.2.0/mpi/2021.10.0//bin:/usr/DPA/tools/oneAPI/2023.2.0/mkl/2023.2.0/bin/intel64:/usr/DPA/tools/oneAPI/2023.2.0/itac/2021.10.0/bin:/usr/DPA/tools/oneAPI/2023.2.0/intelpython/latest/bin:/usr/DPA/tools/oneAPI/2023.2.0/intelpython/latest/condabin:/usr/DPA/tools/oneAPI/2023.2.0/inspector/2023.2.0/bin64:/usr/DPA/tools/oneAPI/2023.2.0/dpcpp-ct/2023.2.0/bin:/usr/DPA/tools/oneAPI/2023.2.0/dev-utilities/2021.10.0/bin:/usr/DPA/tools/oneAPI/2023.2.0/debugger/2023.2.0/gdb/intel64/bin:/usr/DPA/tools/oneAPI/2023.2.0/compiler/2023.2.0/linux/lib/oclfpga/bin:/usr/DPA/tools/oneAPI/2023.2.0/compiler/2023.2.0/linux/bin/intel64:/usr/DPA/tools/oneAPI/2023.2.0/compiler/2023.2.0/linux/bin:/usr/DPA/tools/oneAPI/2023.2.0/advisor/2023.2.0/bin64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
E   ImportError: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or directory

libmkl_sycl_blas.so.4 should be in oneapi 2024

@pbchekin
Copy link
Contributor

@pbchekin Seems oneapi 2023 is still used. And there are something wrong in oneapi environment variables. Should I change someting on the CI machine?

We temporary have 2 CIs working in parallel. You can try to change oneapi in one of them by replacing pvc with oneapi-2024.0.1 in https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/workflows/build_and_test_2.yaml#L75.

@pbchekin
Copy link
Contributor

pbchekin commented Jan 12, 2024

If that works, you need to wait for #241, which disables the old workflow.

@whitneywhtsang
Copy link
Contributor

If that works, you need to wait for #241, which disables the old workflow.

Or I can change the Triton DSE Pre Commit runner to use oneAPI 2024 manually, when it is proven to work on the Build and test CI.

@quintinwang5
Copy link
Contributor Author

If that works, you need to wait for #241, which disables the old workflow.

Or I can change the Triton DSE Pre Commit runner to use oneAPI 2024 manually, when it is proven to work on the Build and test CI.

@whitneywhtsang Can I know the level-zero runtime vesion on that machine?
Large amount of UTs fail with:

RuntimeError: Triton Error [ZE]: 2013265923
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE = 0x78000003
[Validation] generic error code for unsupported features

These UTs can pass locally with oneapi 2024. So I think it's more likely a runtime environment problem.
I'm trying to downgrade( maybe?) the level-zero package to the same version with CI machine to verify this problem.

@pbchekin
Copy link
Contributor

Can I know the level-zero runtime vesion on that machine?

ii  intel-level-zero-gpu         1.3.26241.33-647~22.04
ii  level-zero                   1.11.0-647~22.04
ii  level-zero-dev               1.11.0-647~22.04

@whitneywhtsang
Copy link
Contributor

whitneywhtsang commented Jan 12, 2024

Can I know the level-zero runtime vesion on that machine?

On the Triton DSE Pre Commit runner:

ii  intel-level-zero-gpu                       1.3.26690.29-704~22.04
ii  level-zero                                 1.12.0-693~22.04
ii  level-zero-dev                             1.12.0-693~22.04

@pbchekin
Copy link
Contributor

This is how we install level_zero for the runner:
https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/dockerfiles/runner-base/Dockerfile#L18-L25

How did you install it on a local machine?

@quintinwang5
Copy link
Contributor Author

quintinwang5 commented Jan 12, 2024

1.3.26241.33-647~22.04

intel-level-zero-gpu 1.3.26241.33-647~22.04 leads to RuntimeError: Triton Error [ZE]: 2013265923 locally. But it seems I didn't update intel-level-zero-gpu in the docker image successfully.

RuntimeError: Triton Error [ZE]: 2013265923 returned by zeCommandListHostSynchronize(queue, std::numeric_limits<uint64_t>::max())

It's weird, if we ignore this error code(by not calling PyErr_SetString). The program can still execute properly.
Found this issuse: intel/llvm#12344 It's very similar to this problem.

@@ -131,7 +131,7 @@ def format_of(ty):
char err[1024] = {{0}};
strcat(err, prefix);
strcat(err, str.c_str());
PyErr_SetString(PyExc_RuntimeError, err);
//PyErr_SetString(PyExc_RuntimeError, err);
Copy link
Contributor Author

@quintinwang5 quintinwang5 Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbchekin This is just for testing. Will remove later. Can you help to check whether if we have the specific intel-level-zero-gpu version. Seems it does not work in the Dockerfile. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbchekin let's switch to Agama 775.20 release which should work with oneAPI 2024.0.1/2:

https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agama 775.20 has been installed: kernel driver on hosts, level zero to the runners that labeled with oneapi-2024.0.1 currently selected only for this PR. We will keep two sets of runners (oneapi 2023.2.0, oneapi 2024.0.1) while this PR is not merged.

@pbchekin
Copy link
Contributor

Level zero has been updated to the latest stable rolling release in the runners:

ii  intel-level-zero-gpu         1.3.27191.42-775~22.04
ii  level-zero                   1.14.0-744~22.04
ii  level-zero-dev               1.14.0-744~22.04

@whitneywhtsang
Copy link
Contributor

Level zero cannot be updated to Agama 775.20 on pvc-b4-spr, due to concern of breakage on some other's workloads including pytorch.

@quintinwang5
Copy link
Contributor Author

@pbchekin These errors are something like:

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

This should be a bug in intel-igc-core and intel-igc-opencl.
Verified version with this bug is 1.0.15136.22 and 1.0.15136.4. It can be reproduced by just compiling the spirv kernel produced by UT with ocloc.
Passed all UTs with 1.0.14828.8 locally.
Can you please have a check?

@pbchekin
Copy link
Contributor

Can you please have a check?

Sorry, I don't understand what to check.

There are no deb packages intel-igc-* installed on the runner. OneAPI 2024.0.1 is installed in ~/intel/oneapi by offline installer, we do not control versions that are bundled with the installer.

@quintinwang5
Copy link
Contributor Author

Can you please have a check?

Sorry, I don't understand what to check.

There are no deb packages intel-igc-* installed on the runner. OneAPI 2024.0.1 is installed in ~/intel/oneapi by offline installer, we do not control versions that are bundled with the installer.

Can we install another verison by the docker file? I tried this, but seems it did not work.

@pbchekin
Copy link
Contributor

Can we install another verison by the docker file? I tried this, but seems it did not work.

This Docker file is used to build a runner image, it is not used during CI directly. Instead of modifying Docker file you can try installing packages in the CI workflow with sudo apt install.

@quintinwang5
Copy link
Contributor Author

Can we install another verison by the docker file? I tried this, but seems it did not work.

This Docker file is used to build a runner image, it is not used during CI directly. Instead of modifying Docker file you can try installing packages in the CI workflow with sudo apt install.

Thanks. Having a try.

@quintinwang5
Copy link
Contributor Author

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here.
And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

@etiotto
Copy link
Contributor

etiotto commented Jan 16, 2024

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

IT was probably done for consistency with what the GPU dialect already does for NVVM and ROCDL. The GENX dialect is a counterpart to the NVVM/ROCDL dialects. The GPU dialect have corresponding conversions here for NVVM and here for ROCDL.

Having said that, is questionable that the GPU dialect lowers operations in another dialect (the math dialect in this instance). But that is already the case so I think is OK for us at this point to follow suit.

The latest versions of IGC support those OpenCL functions. I think we need to focus on the latest IGC version. @pengtu what is your opinion?

@quintinwang5
Copy link
Contributor Author

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

IT was probably done for consistency with what the GPU dialect already does for NVVM and ROCDL. The GENX dialect is a counterpart to the NVVM/ROCDL dialects. The GPU dialect have corresponding conversions here for NVVM and here for ROCDL.

Having said that, is questionable that the GPU dialect lowers operations in another dialect (the math dialect in this instance). But that is already the case so I think is OK for us at this point to follow suit.

The latest versions of IGC support those OpenCL functions. I think we need to focus on the latest IGC version. @pengtu what is your opinion?

I noticeed the counterpart of NVVM/ROCDL . But I want to know the rule of choosing these operators. Seems we have a big gap against them.
And I can reproduce this error by the latest public IGC releases(1.0.15136.22, 1.0.15136.4) locally. I removed these lines, UTs work well.

@quintinwang5 quintinwang5 force-pushed the oneapi2024_w_torch2.1 branch from 9cb8087 to f6dbe0c Compare January 17, 2024 03:45
@quintinwang5 quintinwang5 requested a review from pbchekin January 17, 2024 10:32
@quintinwang5
Copy link
Contributor Author

@pbchekin Confirmed the undefined reference to __builtin_spirv_OpenCL error is lead by libigc1 in CI. The CI environment uses libigc.so from libigc1. But the libigc.so in intel-igc-core is used by default on the local machine.
Released igc-1.0.15136.* (igc-1.0.15136.4 and igc-1.0.15136.22) should use __builtin_spirv_OpenCL_* instead of __spirv_ocl_* according to this config. But something went wrong.

@quintinwang5
Copy link
Contributor Author

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

@etiotto Create a JIRA here. We may need to change symbols to khronos format once they switch to khr translator totally.

@etiotto
Copy link
Contributor

etiotto commented Jan 17, 2024

@pbchekin Confirmed the undefined reference to __builtin_spirv_OpenCL error is lead by libigc1 in CI. The CI environment uses libigc.so from libigc1. But the libigc.so in intel-igc-core is used by default on the local machine. Released igc-1.0.15136.* (igc-1.0.15136.4 and igc-1.0.15136.22) should use __builtin_spirv_OpenCL_* instead of __spirv_ocl_* according to this config. But something went wrong.

So we are using the correct symbol.

@etiotto
Copy link
Contributor

etiotto commented Jan 17, 2024

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

@etiotto Create a JIRA here. We may need to change symbols to khronos format once they switch to khr translator totally.

OK thanks for filing a report against IGC.

@etiotto
Copy link
Contributor

etiotto commented Jan 17, 2024

@pbchekin this PR looks in good shape now. And it passed CI. Do you have any further comments and or suggestions?

Once this PR is merged in, PR #180 can be rebased on top of it.

@pbchekin
Copy link
Contributor

@pbchekin this PR looks in good shape now. And it passed CI. Do you have any further comments and or suggestions?

Once this PR is merged in, PR #180 can be rebased on top of it.

Looks good to me! After we merge this we need to communicate to the team that their environments need to be updated.

@etiotto etiotto merged commit 47e7035 into llvm-target Jan 17, 2024
3 checks passed
@etiotto etiotto deleted the oneapi2024_w_torch2.1 branch January 17, 2024 17:40
pbchekin added a commit that referenced this pull request Jan 17, 2024
GitHub runner image based on Ubuntu 22.04 with oneAPI 2024.0.1 and
downgraded libigc1. The downgrade is required for #239.
pbchekin added a commit that referenced this pull request Jan 17, 2024
GitHub runner image based on Ubuntu 22.04 with oneAPI 2024.0.1 and
downgraded libigc1. The downgrade is required for #239.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use oneAPI 2024.0 rather than 2023.2 to align with PyTorch.
5 participants