Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI job on self-hosted runner #1495

Closed
wants to merge 13 commits into from
Closed

Conversation

sloede
Copy link
Member

@sloede sloede commented May 28, 2023

This PR is created to test running individual jobs on a self-hosted GitHub runner.

The newly added jobs are named are one or multiple of the following:

  • mpi - self-hosted - x64 - pull_request
  • mpi - 2core-8gib - x64 - pull_request
  • mpi - 4core-8gib - x64 - pull_request
  • mpi - 8core-16gib - x64 - pull_request
  • mpi - 16core-32gib - x64 - pull_request

@sloede sloede closed this May 28, 2023
@sloede sloede reopened this May 28, 2023
@codecov
Copy link

codecov bot commented May 28, 2023

Codecov Report

Merging #1495 (9543a3b) into main (6bb298d) will decrease coverage by 0.17%.
The diff coverage is n/a.

❗ Current head 9543a3b differs from pull request most recent head 4b38b0a. Consider uploading reports for the commit 4b38b0a to get more accurate results

@@            Coverage Diff             @@
##             main    #1495      +/-   ##
==========================================
- Coverage   96.09%   95.92%   -0.17%     
==========================================
  Files         363      363              
  Lines       30183    30183              
==========================================
- Hits        29003    28951      -52     
- Misses       1180     1232      +52     
Flag Coverage Δ
unittests 95.92% <ø> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 4 files with indirect coverage changes

@sloede sloede closed this May 28, 2023
@sloede sloede reopened this May 28, 2023
@ranocha
Copy link
Member

ranocha commented May 28, 2023

The largest runnerfails at least with a different error message

3 dependencies errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the packages
     Testing Running tests...
ERROR: LoadError: ERROR: LoadError: ERROR: LoadError: InexactError: check_top_bit(InexactError: check_top_bit(UInt64, -2093057)
Stacktrace:
  [1] throw_inexacterror(f::Symbol, #unused#::Type{UInt64}, val::Int64)
    @ Core ./boot.jl:614
  [2] check_top_bit
    @ ./boot.jl:628 [inlined]
  [3] toUInt64
    @ ./boot.jl:739 [inlined]
  [4] UInt64
    @ ./boot.jl:769 [inlined]
  [5] convert
    @ ./number.jl:7 [inlined]
  [6] cconvert
    @ ./essentials.jl:412 [inlined]
  [7] malloc
    @ ./libc.jl:355 [inlined]
  [8] valloc
    @ ~/.julia/packages/VectorizationBase/0dXyA/src/alignment.jl:36 [inlined]
  [9] init_bcache
    @ ~/.julia/packages/Octavian/XhL0C/src/init.jl:19 [inlined]
 [10] __init__()
    @ Octavian ~/.julia/packages/Octavian/XhL0C/src/init.jl:3
 [11] macro expansion
    @ ~/.julia/packages/Octavian/XhL0C/src/Octavian.jl:80 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/SnoopPrecompile/1XXT1/src/SnoopPrecompile.jl:119 [inlined]
 [13] top-level scope
    @ ~/.julia/packages/Octavian/XhL0C/src/Octavian.jl:77
 [14] include
    @ ./Base.jl:419 [inlined]
 [15] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::String)
    @ Base ./loading.jl:1554
 [16] top-level scope
    @ stdin:1
in expression starting at /root/.julia/packages/Octavian/XhL0C/src/Octavian.jl:1
in expression starting at stdin:1
UInt64, -20[930](https://github.com/trixi-framework/Trixi.jl/actions/runs/5103870033/jobs/9174408300?pr=1495#step:7:933)57)

https://github.com/trixi-framework/Trixi.jl/actions/runs/5103870033/jobs/9174408300?pr=1495#step:7:899

@ranocha
Copy link
Member

ranocha commented May 28, 2023

Would it make sense to limit CI runners to only 2 MPI ranks in

const TRIXI_MPI_NPROCS = clamp(Sys.CPU_THREADS, 2, 3)

?

@ranocha
Copy link
Member

ranocha commented May 28, 2023

Would it make sense to see whether we can nudge the GC by adding something like --heap-size-hint=1G to the argument lit of julia in

mpiexec() do cmd
run(`$cmd -n $TRIXI_MPI_NPROCS $(Base.julia_cmd()) --threads=1 --check-bounds=yes $(abspath("test_mpi.jl"))`)
end

for Julia v1.9 and newer?

@sloede
Copy link
Member Author

sloede commented May 29, 2023

The largest runnerfails at least with a different error message

[...]

This seems to be a known behavior that was recently uncovered (I am assuming because they did some optimizations in Octavian.jl that are making even more use of the specific hardware details): JuliaLinearAlgebra/Octavian.jl#177

@sloede
Copy link
Member Author

sloede commented May 29, 2023

Would it make sense to see whether we can nudge the GC by adding something like --heap-size-hint=1G to the argument lit of julia in

At least in 1.9.0 I did not see much improvement there, unfortunately. When I did my memory usage tests with different Julia versions, I sometimes used --heap-size-hint=1M, which had little effect on the max. RSS 😕 But maybe we can try again with v1.9.1

@sloede
Copy link
Member Author

sloede commented May 29, 2023

Would it make sense to limit CI runners to only 2 MPI ranks [...] ?

Good idea. Let's see what happens for 7c6159d.

@sloede sloede closed this May 29, 2023
@sloede sloede reopened this May 29, 2023
@sloede sloede added the testing label May 30, 2023
@sloede sloede closed this May 30, 2023
@sloede sloede reopened this May 30, 2023
@sloede
Copy link
Member Author

sloede commented May 30, 2023

Look at this... With the new reduced-memory setup the MPI tests also pass the 4core-8gib machine 🥳

@sloede
Copy link
Member Author

sloede commented Jun 2, 2023

JuliaLinearAlgebra/Octavian.jl#177 strikes again 😢

@sloede
Copy link
Member Author

sloede commented Jun 4, 2023

I have concluded the testing phase of self-hosted runners on cloud or on-prem hardware. This PR is thus obsolete; we now need to decide how to proceed with the additional testing.

@sloede sloede closed this Jun 4, 2023
@sloede sloede deleted the msl/add-job-on-self-hosted-runner branch June 4, 2023 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants