[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0 #239

quintinwang5 · 2024-01-11T14:53:30Z

Update to oneapi toolkit 2024 and update to torch 2.1.0.
They should be updated at the same time because ipex 1.13 package has dynamic link to libraries in oneapi 2023.

pbchekin · 2024-01-11T18:36:42Z

@quintinwang5 did it work locally?

.github/dockerfiles/runner-base/Dockerfile

.github/workflows/docker-runner-base.yaml

quintinwang5 · 2024-01-12T00:29:05Z

@quintinwang5 did it work locally?

Yes. I have oneapi 2024.0 locally. It works well.

quintinwang5 · 2024-01-12T00:56:19Z

@pbchekin Seems oneapi 2023 is still used. And there are something wrong in oneapi environment variables. Should I change someting on the CI machine?

PATH: /home/waihungt/.local/bin:/home/whitney/bin:/home/waihungt/.vscode-server/bin/1a5daa3a0231a0fbba4f14db7ec463cf99d7768e/bin/remote-cli:/home/waihungt/.local/bin:/home/waihungt/bin:/home/whitney/bin:/usr/DPA/tools/oneAPI/2023.2.0/vtune/2023.2.0/bin64:/usr/DPA/tools/oneAPI/2023.2.0/mpi/2021.10.0//libfabric/bin:/usr/DPA/tools/oneAPI/2023.2.0/mpi/2021.10.0//bin:/usr/DPA/tools/oneAPI/2023.2.0/mkl/2023.2.0/bin/intel64:/usr/DPA/tools/oneAPI/2023.2.0/itac/2021.10.0/bin:/usr/DPA/tools/oneAPI/2023.2.0/intelpython/latest/bin:/usr/DPA/tools/oneAPI/2023.2.0/intelpython/latest/condabin:/usr/DPA/tools/oneAPI/2023.2.0/inspector/2023.2.0/bin64:/usr/DPA/tools/oneAPI/2023.2.0/dpcpp-ct/2023.2.0/bin:/usr/DPA/tools/oneAPI/2023.2.0/dev-utilities/2021.10.0/bin:/usr/DPA/tools/oneAPI/2023.2.0/debugger/2023.2.0/gdb/intel64/bin:/usr/DPA/tools/oneAPI/2023.2.0/compiler/2023.2.0/linux/lib/oclfpga/bin:/usr/DPA/tools/oneAPI/2023.2.0/compiler/2023.2.0/linux/bin/intel64:/usr/DPA/tools/oneAPI/2023.2.0/compiler/2023.2.0/linux/bin:/usr/DPA/tools/oneAPI/2023.2.0/advisor/2023.2.0/bin64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

E   ImportError: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or directory

libmkl_sycl_blas.so.4 should be in oneapi 2024

pbchekin · 2024-01-12T01:07:51Z

@pbchekin Seems oneapi 2023 is still used. And there are something wrong in oneapi environment variables. Should I change someting on the CI machine?

We temporary have 2 CIs working in parallel. You can try to change oneapi in one of them by replacing pvc with oneapi-2024.0.1 in https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/workflows/build_and_test_2.yaml#L75.

pbchekin · 2024-01-12T01:09:18Z

If that works, you need to wait for #241, which disables the old workflow.

whitneywhtsang · 2024-01-12T01:37:31Z

If that works, you need to wait for #241, which disables the old workflow.

Or I can change the Triton DSE Pre Commit runner to use oneAPI 2024 manually, when it is proven to work on the Build and test CI.

quintinwang5 · 2024-01-12T02:27:21Z

If that works, you need to wait for #241, which disables the old workflow.

Or I can change the Triton DSE Pre Commit runner to use oneAPI 2024 manually, when it is proven to work on the Build and test CI.

@whitneywhtsang Can I know the level-zero runtime vesion on that machine?
Large amount of UTs fail with:

RuntimeError: Triton Error [ZE]: 2013265923
ZE_RESULT_ERROR_UNSUPPORTED_FEATURE = 0x78000003
[Validation] generic error code for unsupported features

These UTs can pass locally with oneapi 2024. So I think it's more likely a runtime environment problem.
I'm trying to downgrade( maybe?) the level-zero package to the same version with CI machine to verify this problem.

pbchekin · 2024-01-12T02:34:22Z

Can I know the level-zero runtime vesion on that machine?

ii  intel-level-zero-gpu         1.3.26241.33-647~22.04
ii  level-zero                   1.11.0-647~22.04
ii  level-zero-dev               1.11.0-647~22.04

whitneywhtsang · 2024-01-12T02:36:22Z

Can I know the level-zero runtime vesion on that machine?

On the Triton DSE Pre Commit runner:

ii  intel-level-zero-gpu                       1.3.26690.29-704~22.04
ii  level-zero                                 1.12.0-693~22.04
ii  level-zero-dev                             1.12.0-693~22.04

pbchekin · 2024-01-12T02:52:02Z

This is how we install level_zero for the runner:
https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/dockerfiles/runner-base/Dockerfile#L18-L25

How did you install it on a local machine?

quintinwang5 · 2024-01-12T04:04:45Z

1.3.26241.33-647~22.04

intel-level-zero-gpu 1.3.26241.33-647~22.04 leads to RuntimeError: Triton Error [ZE]: 2013265923 locally. But it seems I didn't update intel-level-zero-gpu in the docker image successfully.

RuntimeError: Triton Error [ZE]: 2013265923 returned by zeCommandListHostSynchronize(queue, std::numeric_limits<uint64_t>::max())

It's weird, if we ignore this error code(by not calling PyErr_SetString). The program can still execute properly.
Found this issuse: intel/llvm#12344 It's very similar to this problem.

quintinwang5 · 2024-01-12T09:03:31Z

python/triton/compiler/make_launcher.py

@@ -131,7 +131,7 @@ def format_of(ty):
         char err[1024] = {{0}};
         strcat(err, prefix);
         strcat(err, str.c_str());
-         PyErr_SetString(PyExc_RuntimeError, err);
+         //PyErr_SetString(PyExc_RuntimeError, err);


@pbchekin This is just for testing. Will remove later. Can you help to check whether if we have the specific intel-level-zero-gpu version. Seems it does not work in the Dockerfile. Thanks!

@pbchekin let's switch to Agama 775.20 release which should work with oneAPI 2024.0.1/2:

https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html

Agama 775.20 has been installed: kernel driver on hosts, level zero to the runners that labeled with oneapi-2024.0.1 currently selected only for this PR. We will keep two sets of runners (oneapi 2023.2.0, oneapi 2024.0.1) while this PR is not merged.

.github/dockerfiles/runner-base/Dockerfile

pbchekin · 2024-01-12T18:03:52Z

Level zero has been updated to the latest stable rolling release in the runners:

ii  intel-level-zero-gpu         1.3.27191.42-775~22.04
ii  level-zero                   1.14.0-744~22.04
ii  level-zero-dev               1.14.0-744~22.04

whitneywhtsang · 2024-01-12T18:27:50Z

Level zero cannot be updated to Agama 775.20 on pvc-b4-spr, due to concern of breakage on some other's workloads including pytorch.

quintinwang5 · 2024-01-15T01:03:57Z

@pbchekin These errors are something like:

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

This should be a bug in intel-igc-core and intel-igc-opencl.
Verified version with this bug is 1.0.15136.22 and 1.0.15136.4. It can be reproduced by just compiling the spirv kernel produced by UT with ocloc.
Passed all UTs with 1.0.14828.8 locally.
Can you please have a check?

pbchekin · 2024-01-15T01:15:05Z

Can you please have a check?

Sorry, I don't understand what to check.

There are no deb packages intel-igc-* installed on the runner. OneAPI 2024.0.1 is installed in ~/intel/oneapi by offline installer, we do not control versions that are bundled with the installer.

quintinwang5 · 2024-01-15T01:22:37Z

Can you please have a check?

Sorry, I don't understand what to check.

There are no deb packages intel-igc-* installed on the runner. OneAPI 2024.0.1 is installed in ~/intel/oneapi by offline installer, we do not control versions that are bundled with the installer.

Can we install another verison by the docker file? I tried this, but seems it did not work.

pbchekin · 2024-01-15T01:25:45Z

Can we install another verison by the docker file? I tried this, but seems it did not work.

This Docker file is used to build a runner image, it is not used during CI directly. Instead of modifying Docker file you can try installing packages in the CI workflow with sudo apt install.

quintinwang5 · 2024-01-15T01:27:32Z

Can we install another verison by the docker file? I tried this, but seems it did not work.

This Docker file is used to build a runner image, it is not used during CI directly. Instead of modifying Docker file you can try installing packages in the CI workflow with sudo apt install.

Thanks. Having a try.

quintinwang5 · 2024-01-16T09:15:50Z

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here.
And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.

error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

etiotto · 2024-01-16T15:00:30Z

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.
error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

IT was probably done for consistency with what the GPU dialect already does for NVVM and ROCDL. The GENX dialect is a counterpart to the NVVM/ROCDL dialects. The GPU dialect have corresponding conversions here for NVVM and here for ROCDL.

Having said that, is questionable that the GPU dialect lowers operations in another dialect (the math dialect in this instance). But that is already the case so I think is OK for us at this point to follow suit.

The latest versions of IGC support those OpenCL functions. I think we need to focus on the latest IGC version. @pengtu what is your opinion?

quintinwang5 · 2024-01-17T01:59:38Z

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.
error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'
IT was probably done for consistency with what the GPU dialect already does for NVVM and ROCDL. The GENX dialect is a counterpart to the NVVM/ROCDL dialects. The GPU dialect have corresponding conversions here for NVVM and here for ROCDL.

Having said that, is questionable that the GPU dialect lowers operations in another dialect (the math dialect in this instance). But that is already the case so I think is OK for us at this point to follow suit.

The latest versions of IGC support those OpenCL functions. I think we need to focus on the latest IGC version. @pengtu what is your opinion?

I noticeed the counterpart of NVVM/ROCDL . But I want to know the rule of choosing these operators. Seems we have a big gap against them.
And I can reproduce this error by the latest public IGC releases(1.0.15136.22, 1.0.15136.4) locally. I removed these lines, UTs work well.

This reverts commit f5914ca.

quintinwang5 · 2024-01-17T10:47:02Z

@pbchekin Confirmed the undefined reference to __builtin_spirv_OpenCL error is lead by libigc1 in CI. The CI environment uses libigc.so from libigc1. But the libigc.so in intel-igc-core is used by default on the local machine.
Released igc-1.0.15136.* (igc-1.0.15136.4 and igc-1.0.15136.22) should use __builtin_spirv_OpenCL_* instead of __spirv_ocl_* according to this config. But something went wrong.

quintinwang5 · 2024-01-17T12:21:31Z

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.
error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'

@etiotto Create a JIRA here. We may need to change symbols to khronos format once they switch to khr translator totally.

etiotto · 2024-01-17T16:15:45Z

@pbchekin Confirmed the undefined reference to __builtin_spirv_OpenCL error is lead by libigc1 in CI. The CI environment uses libigc.so from libigc1. But the libigc.so in intel-igc-core is used by default on the local machine. Released igc-1.0.15136.* (igc-1.0.15136.4 and igc-1.0.15136.22) should use __builtin_spirv_OpenCL_* instead of __spirv_ocl_* according to this config. But something went wrong.

So we are using the correct symbol.

etiotto · 2024-01-17T16:16:50Z

@etiotto Can I know the reason why we lower these ops(sin, cos, exp, log) here. since they can be lowered here. And the dest function call __builtin_spirv_OpenCL_* may casue build failures for some igc versions.
error: undefined reference to `__builtin_spirv_OpenCL_sin_f64'
in function: '__builtin_spirv_OpenCL_sin_f64' called by kernel: 'kernel_0d1d'
@etiotto Create a JIRA here. We may need to change symbols to khronos format once they switch to khr translator totally.

OK thanks for filing a report against IGC.

etiotto · 2024-01-17T17:22:26Z

@pbchekin this PR looks in good shape now. And it passed CI. Do you have any further comments and or suggestions?

Once this PR is merged in, PR #180 can be rebased on top of it.

pbchekin · 2024-01-17T17:36:48Z

@pbchekin this PR looks in good shape now. And it passed CI. Do you have any further comments and or suggestions?

Once this PR is merged in, PR #180 can be rebased on top of it.

Looks good to me! After we merge this we need to communicate to the team that their environments need to be updated.

GitHub runner image based on Ubuntu 22.04 with oneAPI 2024.0.1 and downgraded libigc1. The downgrade is required for #239.

pbchekin reviewed Jan 11, 2024

View reviewed changes

.github/dockerfiles/runner-base/Dockerfile Outdated Show resolved Hide resolved

.github/workflows/docker-runner-base.yaml Outdated Show resolved Hide resolved

quintinwang5 commented Jan 12, 2024

View reviewed changes

whitneywhtsang mentioned this pull request Jan 12, 2024

[BACKEND] Added sycl backend support #234

Merged

pbchekin reviewed Jan 12, 2024

View reviewed changes

.github/dockerfiles/runner-base/Dockerfile Outdated Show resolved Hide resolved

quintinwang5 force-pushed the oneapi2024_w_torch2.1 branch from c3f4eb2 to 97c42da Compare January 15, 2024 06:37

etiotto assigned quintinwang5 Jan 15, 2024

tdeng5 linked an issue Jan 16, 2024 that may be closed by this pull request

Use oneAPI 2024.0 rather than 2023.2 to align with PyTorch. #173

Closed

quintinwang5 added 2 commits January 16, 2024 18:04

[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0

e3f4f18

fix format errors

897ebeb

quintinwang5 added 3 commits January 16, 2024 18:40

debug print

f5914ca

Revert "debug print"

60afe62

This reverts commit f5914ca.

rebase to llvm-target

f6dbe0c

quintinwang5 force-pushed the oneapi2024_w_torch2.1 branch from 9cb8087 to f6dbe0c Compare January 17, 2024 03:45

quintinwang5 added 8 commits January 16, 2024 22:17

update igc verison

288e4df

update libigc1 and debug

da5dfc6

update gpu repo

0245d90

fix script error

3f5b6d7

fix script error

deb225a

Update runtime UTs

145c8b5

update test_line_info for a added new line

5b3e905

update operators UT

ec68334

quintinwang5 requested a review from pbchekin January 17, 2024 10:32

Merge branch 'llvm-target' into oneapi2024_w_torch2.1

21095cd

pbchekin approved these changes Jan 17, 2024

View reviewed changes

etiotto merged commit 47e7035 into llvm-target Jan 17, 2024
3 checks passed

etiotto deleted the oneapi2024_w_torch2.1 branch January 17, 2024 17:40

pbchekin added a commit that referenced this pull request Jan 17, 2024

triton-runner-base:0.0.4

b97d6f0

GitHub runner image based on Ubuntu 22.04 with oneAPI 2024.0.1 and downgraded libigc1. The downgrade is required for #239.

pbchekin mentioned this pull request Jan 17, 2024

triton-runner-base:0.0.4 #294

Merged

alexbaden mentioned this pull request Jan 17, 2024

Support for pytorch 2.4 #151

Closed

pbchekin added a commit that referenced this pull request Jan 17, 2024

triton-runner-base:0.0.4 (#294)

1f704ee

GitHub runner image based on Ubuntu 22.04 with oneAPI 2024.0.1 and downgraded libigc1. The downgrade is required for #239.

etiotto mentioned this pull request Jan 18, 2024

Use oneAPI 2024.0 rather than 2023.2 to align with PyTorch. #173

Closed

whitneywhtsang mentioned this pull request Jan 24, 2024

Build instructions don't work #335

Closed

chengjunlu mentioned this pull request Jan 25, 2024

IPEX changes the return type of the API sycl_queue and sycl_device. Triton XPU backend need to align these changes. #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0 #239

[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0 #239

quintinwang5 commented Jan 11, 2024 •

edited

Loading

pbchekin commented Jan 11, 2024

quintinwang5 commented Jan 12, 2024

quintinwang5 commented Jan 12, 2024 •

edited

Loading

pbchekin commented Jan 12, 2024

pbchekin commented Jan 12, 2024 •

edited

Loading

whitneywhtsang commented Jan 12, 2024

quintinwang5 commented Jan 12, 2024

pbchekin commented Jan 12, 2024

whitneywhtsang commented Jan 12, 2024 •

edited

Loading

pbchekin commented Jan 12, 2024

quintinwang5 commented Jan 12, 2024 •

edited

Loading

quintinwang5 Jan 12, 2024 •

edited

Loading

vlad-penkin Jan 12, 2024

pbchekin Jan 12, 2024

pbchekin commented Jan 12, 2024

whitneywhtsang commented Jan 12, 2024

quintinwang5 commented Jan 15, 2024

pbchekin commented Jan 15, 2024

quintinwang5 commented Jan 15, 2024

pbchekin commented Jan 15, 2024

quintinwang5 commented Jan 15, 2024

quintinwang5 commented Jan 16, 2024

etiotto commented Jan 16, 2024 •

edited

Loading

quintinwang5 commented Jan 17, 2024

quintinwang5 commented Jan 17, 2024

quintinwang5 commented Jan 17, 2024

etiotto commented Jan 17, 2024

etiotto commented Jan 17, 2024

etiotto commented Jan 17, 2024 •

edited

Loading

pbchekin commented Jan 17, 2024

[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0 #239

[UPDATE]: update to oneapi toolkit 2024 and torch version 2.1.0 #239

Conversation

quintinwang5 commented Jan 11, 2024 • edited Loading

pbchekin commented Jan 11, 2024

quintinwang5 commented Jan 12, 2024

quintinwang5 commented Jan 12, 2024 • edited Loading

pbchekin commented Jan 12, 2024

pbchekin commented Jan 12, 2024 • edited Loading

whitneywhtsang commented Jan 12, 2024

quintinwang5 commented Jan 12, 2024

pbchekin commented Jan 12, 2024

whitneywhtsang commented Jan 12, 2024 • edited Loading

pbchekin commented Jan 12, 2024

quintinwang5 commented Jan 12, 2024 • edited Loading

quintinwang5 Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

vlad-penkin Jan 12, 2024

Choose a reason for hiding this comment

pbchekin Jan 12, 2024

Choose a reason for hiding this comment

pbchekin commented Jan 12, 2024

whitneywhtsang commented Jan 12, 2024

quintinwang5 commented Jan 15, 2024

pbchekin commented Jan 15, 2024

quintinwang5 commented Jan 15, 2024

pbchekin commented Jan 15, 2024

quintinwang5 commented Jan 15, 2024

quintinwang5 commented Jan 16, 2024

etiotto commented Jan 16, 2024 • edited Loading

quintinwang5 commented Jan 17, 2024

quintinwang5 commented Jan 17, 2024

quintinwang5 commented Jan 17, 2024

etiotto commented Jan 17, 2024

etiotto commented Jan 17, 2024

etiotto commented Jan 17, 2024 • edited Loading

pbchekin commented Jan 17, 2024

quintinwang5 commented Jan 11, 2024 •

edited

Loading

quintinwang5 commented Jan 12, 2024 •

edited

Loading

pbchekin commented Jan 12, 2024 •

edited

Loading

whitneywhtsang commented Jan 12, 2024 •

edited

Loading

quintinwang5 commented Jan 12, 2024 •

edited

Loading

quintinwang5 Jan 12, 2024 •

edited

Loading

etiotto commented Jan 16, 2024 •

edited

Loading

etiotto commented Jan 17, 2024 •

edited

Loading