Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA version of settler_cover #746

Closed
wants to merge 13 commits into from
Closed

CUDA version of settler_cover #746

wants to merge 13 commits into from

Conversation

arlowhite
Copy link
Collaborator

@arlowhite arlowhite commented May 2, 2024

Create a CUDA (Nvidia CPU lib) version of settler_cover
#739

TODO

  • fix CUDA Scalar indexing error with multiply
  • compare performance
  • try configuring kernel to see if improves performance
  • how to enable CUDA

Draft Status

The CUDA implementation is in a temporary sandbox project directory test_performance. We need to determine how to organize the CUDA code prior to merging this PR. There's a question of whether CUDA should be used automatically if available, or a config flag for ADRIA.

CUDA ext

Probably a Julia ext like VizExt. settler_growth_cuda will be moved there.

If Julia allows dynamic imports, I'm thinking something like:

settler_cover = config.enable_cuda
  ? using AdriaCuda; settler_cover_cuda
  : settler_cover

where AdriaCuda would be an ext.

CUDA in base pkg

Alternatively, we just add CUDA to base and follow this approach
https://cuda.juliagpu.org/stable/installation/conditional/#Scenario-2:-GPU-is-optional

const use_gpu = Ref(false)
to_gpu_or_not_to_gpu(x::AbstractArray) = use_gpu[] ? CuArray(x) : x

function __init__()
    use_gpu[] = CUDA.functional()
end

Blocking CUDA Error

    @views potential_settlers[:, valid_sinks] .= (
        fec_scope[:, valid_sources] * conn[valid_sources, valid_sinks]
    )
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore C:\Users\awhite\.julia\packages\GPUArraysCore\GMsgk\src\GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore C:\Users\awhite\.julia\packages\GPUArraysCore\GMsgk\src\GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore C:\Users\awhite\.julia\packages\GPUArraysCore\GMsgk\src\GPUArraysCore.jl:116
  [5] getindex
    @ C:\Users\awhite\.julia\packages\GPUArrays\OKkAu\src\host\indexing.jl:48 [inlined]
  [6] reindex
    @ .\subarray.jl:268 [inlined]
  [7] reindex
    @ .\subarray.jl:264 [inlined]
  [8] getindex
    @ .\subarray.jl:290 [inlined]
  [9] _generic_matmatmul!(C::CuArray{…}, tA::Char, tB::Char, A::SubArray{…}, B::SubArray{…}, _add::LinearAlgebra.MulAddMul{…})
    @ LinearAlgebra C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:814
 [10] generic_matmatmul!(C::CuArray{…}, tA::Char, tB::Char, A::SubArray{…}, B::SubArray{…}, _add::LinearAlgebra.MulAddMul{…})
    @ LinearAlgebra C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:783
 [11] mul!
    @ C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:263 [inlined]
 [12] mul!
    @ C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:237 [inlined]
 [13] *
    @ C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:106 [inlined]
 [14] _settler_cover(fec_scope::CuArray{…}, conn::CuArray{…}, leftover_space::CuArray{…}, α::CuArray{…}, β::CuArray{…}, basal_area_per_settler::CuArray{…}, potential_settlers::CuArray{…}, valid_sources::CuArray{…}, valid_sinks::CuArray{…})
    @ ADRIA c:\Users\awhite\code\ADRIA.jl\src\ecosystem\corals\growth.jl:718
 [15] settler_cover_cuda(fec_scope::Matrix{…}, conn::Matrix{…}, leftover_space::Matrix{…}, α::Vector{…}, β::Vector{…}, basal_area_per_settler::Vector{…}, potential_settlers::Matrix{…})

see https://cuda.juliagpu.org/stable/usage/workflow/#UsageWorkflowScalar

https://cuda.juliagpu.org/stable/usage/array/#Array-wrappers

Certain common operations, like broadcast or matrix multiplication, do know how to deal with array wrappers by using the Adapt.jl package. This is still not a complete solution though, e.g. new array wrappers are not covered, and only one level of wrapping is supported. Sometimes the only solution is to materialize the wrapper to a CuArray again.

Notes

pkg add CUDA takes a long time when it detects a Nvidia GPU. It downloads multiple runtime files. You can specify local=true to prevent this and use the system CUDA toolkit.
https://cuda.juliagpu.org/stable/installation/overview/#Using-a-local-CUDA

Also see https://cuda.juliagpu.org/stable/installation/overview/#Using-a-local-CUDA
On HPC g001, I use a different Julia depot path to get around this. But CUDA.set_runtime_version!(v"11.8"; local_toolkit=true) could possibly work as well?

@arlowhite
Copy link
Collaborator Author

arlowhite commented May 2, 2024

Removing @views fixes the multiply error. Then it's not happy with replace in recruitment_rate
ERROR: Scalar indexing is disallowed.

  [5] getindex(A::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, I::Int64)
    @ GPUArrays C:\Users\awhite\.julia\packages\GPUArrays\OKkAu\src\host\indexing.jl:48
  [6] _replace!(new::Base.var"#new#395"{…}, res::CuArray{…}, A::CuArray{…}, count::Int64)
    @ Base .\set.jl:802
  [7] replace_pairs!
    @ .\set.jl:616 [inlined]
  [8] #replace#397
    @ .\set.jl:690 [inlined]
  [9] replace
    @ .\set.jl:686 [inlined]
 [10] recruitment_rate(larval_pool::CuArray{…}, A::CuArray{…}; α::CuArray{…}, β::CuArray{…})
    @ ADRIA c:\Users\awhite\code\ADRIA.jl\src\ecosystem\corals\growth.jl:658

@ConnectedSystems
Copy link
Collaborator

I wonder at what point it becomes no longer worth the overhead of transferring data to the GPU.

For example, a small domain of < 100 locations or if conditions are such that there are low number of locations to be evaluated (e.g., low numbers of valid_sources and valid_sinks), does performance decrease in these cases or is it always faster to run on the GPU?

@arlowhite
Copy link
Collaborator Author

Ideally, all of settler_cover_cuda would be on the GPU, and then I think it would always be worth it. But I suspect memory is shifting back and forth too frequently the way it's coded now. I did try @cuda to compile the function as a kernel, but that didn't work and I don't totally understand how that works.

Working example of CUDA-fied `settler_cover()`
@ConnectedSystems
Copy link
Collaborator

But I suspect memory is shifting back and forth too frequently the way it's coded now

Yep, definitely too much data transfer happening, but now we have a first pass working example.

There are ways around this:
https://cuda.juliagpu.org/stable/lib/driver/#CUDA.Mem.UnifiedBuffer

@arlowhite
Copy link
Collaborator Author

Float32 is about twice as fast. So it goes from ~9x to 17x faster GPU vs CPU.

In the last few commits, I went back to using this original code. @ConnectedSystems impl (now 'settler_cover_cuda2`) is a bit faster, but I don't understand how it's equivalent.

    potential_settlers[:, valid_sinks] .= (
        fec_scope[:, valid_sources] * conn[valid_sources, valid_sinks]
    )

I tried cu(x, unified=true) but it was much slower.

@arlowhite
Copy link
Collaborator Author

Also, can't we cache valid_sources and valid_sinks generally? the speedup is significant.

@ConnectedSystems
Copy link
Collaborator

Also, can't we cache valid_sources and valid_sinks generally? the speedup is significant.

We can't cache, as it may change from time step to time step, but we can preallocate to reuse the same vector. That's what I was referring to in the comments.

# Calculate settler cover and copy result back to host
# This matrix multiplication is the most time-consuming part
# (`recruitment_rate()` takes < 1ms)
# FIXME reversed? dest param is first.
Copy link
Collaborator

@ConnectedSystems ConnectedSystems May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, I think you're right, I accidentally switched it

@arlowhite
Copy link
Collaborator Author

Closing this for now. I don't think it's worth spending much time on CUDA until more GPU compute becomes available. At the moment, the strategy is that model runs should be done on the AIMS HPC, but we only have one GPU node.

@arlowhite arlowhite closed this Aug 14, 2024
@ConnectedSystems
Copy link
Collaborator

I thought I commented but it seemed to have disappeared.

The team has access to a dedicated remote desktop with a GPU. Someone else is using it hence why I asked you to prototype on the HPC.

@arlowhite
Copy link
Collaborator Author

arlowhite commented Aug 14, 2024

You commented on #747

I did work on both the HPC and remote desktop. I usually close PRs if they're not merging soon and re-open them later when work resumes, but if your prefer to keep it open we can. At the moment, I'm focused on other tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants