Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_benchdnn_modeC_conv_ci_cpu fails on AArch64 CI for c7g instance #2303

Open
renato-arantes opened this issue Dec 20, 2024 · 5 comments · Fixed by #2357
Open

test_benchdnn_modeC_conv_ci_cpu fails on AArch64 CI for c7g instance #2303

renato-arantes opened this issue Dec 20, 2024 · 5 comments · Fixed by #2357
Labels
bug A confirmed library bug platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Comments

@renato-arantes
Copy link
Contributor

Summary

The test test_benchdnn_modeC_conv_ci_cpu with fpmath mode enabled for bf16 fails on AArch64 GitHub CI at an AWS c7g.

Version

ACL_VERSION: v24.11.1

Environment

OneDNN GitHub CI for AArch64 on a c7g AWS instance.

Steps to reproduce

benchdnn --conv --dir=FWD_D --attr-fpmath=bf16

Observed behaviour

Test fail:

2024-12-19T17:06:18.6979025Z run: --conv --dir=FWD_D --attr-fpmath=bf16 ic17ih8oc17oh4kh1sh2ph0n"conv_basic_2d:1x1_stride_tail"
2024-12-19T17:06:18.6979225Z [   4][DST][0:0:1:0] exp_f32:         -25 exp:         -25 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6979403Z [  11][DST][0:0:2:3] exp_f32:         -19 exp:         -19 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6979583Z [  14][DST][0:0:3:2] exp_f32:         -29 exp:         -29 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6979751Z [  15][DST][0:0:3:3] exp_f32:         -16 exp:         -16 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6979922Z [  20][DST][0:1:1:0] exp_f32:          24 exp:          24 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6980092Z [  27][DST][0:1:2:3] exp_f32:         -35 exp:         -35 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6980363Z [  30][DST][0:1:3:2] exp_f32:         -30 exp:         -30 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6980534Z [  31][DST][0:1:3:3] exp_f32:         -26 exp:         -26 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6980713Z [  36][DST][0:2:1:0] exp_f32:         -17 exp:         -17 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6980886Z [  43][DST][0:2:2:3] exp_f32:         -40 exp:         -40 got:         nan diff:     nan rdiff:     nan
2024-12-19T17:06:18.6981241Z [COMPARE_STATS][DST]: trh=0 err_max_diff:     nan err_max_rdiff:     nan all_max_diff:       0 all_max_rdiff:       0

Expected behavior

Test pass.

@renato-arantes renato-arantes added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Dec 20, 2024
@theComputeKid theComputeKid added bug A confirmed library bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Dec 20, 2024
@theComputeKid
Copy link
Member

This is a sporadic issue that shows up in the AArch64 CI now that we have expanded the test set. It goes away if the job is restarted.

@theComputeKid theComputeKid changed the title Test test_benchdnn_modeC_conv_ci_cpu failing on AArch64 GitHub CI for c7g AWS instance. test_benchdnn_modeC_conv_ci_cpu fails on AArch64 CI for c7g instance Dec 20, 2024
@vpirogov vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Dec 20, 2024
@vpirogov
Copy link
Member

This is a sporadic issue that shows up in the AArch64 CI now that we have expanded the test set. It goes away if the job is restarted.

Three times as fun as stable fail!

@Sqvid
Copy link
Contributor

Sqvid commented Jan 15, 2025

This is still happening, though far less often since #2357, so I will reopen this issue till we can fix it for good.

@Sqvid Sqvid reopened this Jan 15, 2025
@theComputeKid
Copy link
Member

This happens for OMP_NUM_THREADS=1-64, but most frequently on 16 threads which is conveniently what we have in CI.

@michalowski-arm
Copy link
Contributor

#2403 [merged] reduces the issue as a side-effect. It moves the jit_conv implementations above acl_conv in the convolution list since it gives better performance, but simultaneously it does not fail in the above test. So the underlying problem is not solved yet but the CI should not be failing due to this test anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A confirmed library bug platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants