CUDA version of settler_cover #746

arlowhite · 2024-05-02T05:10:46Z

Create a CUDA (Nvidia CPU lib) version of settler_cover
#739

TODO

fix CUDA Scalar indexing error with multiply
compare performance
try configuring kernel to see if improves performance
how to enable CUDA

Draft Status

The CUDA implementation is in a temporary sandbox project directory test_performance. We need to determine how to organize the CUDA code prior to merging this PR. There's a question of whether CUDA should be used automatically if available, or a config flag for ADRIA.

CUDA ext

Probably a Julia ext like VizExt. settler_growth_cuda will be moved there.

If Julia allows dynamic imports, I'm thinking something like:

settler_cover = config.enable_cuda
  ? using AdriaCuda; settler_cover_cuda
  : settler_cover

where AdriaCuda would be an ext.

CUDA in base pkg

Alternatively, we just add CUDA to base and follow this approach
https://cuda.juliagpu.org/stable/installation/conditional/#Scenario-2:-GPU-is-optional

const use_gpu = Ref(false)
to_gpu_or_not_to_gpu(x::AbstractArray) = use_gpu[] ? CuArray(x) : x

function __init__()
    use_gpu[] = CUDA.functional()
end

Blocking CUDA Error

    @views potential_settlers[:, valid_sinks] .= (
        fec_scope[:, valid_sources] * conn[valid_sources, valid_sinks]
    )

ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore C:\Users\awhite\.julia\packages\GPUArraysCore\GMsgk\src\GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore C:\Users\awhite\.julia\packages\GPUArraysCore\GMsgk\src\GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore C:\Users\awhite\.julia\packages\GPUArraysCore\GMsgk\src\GPUArraysCore.jl:116
  [5] getindex
    @ C:\Users\awhite\.julia\packages\GPUArrays\OKkAu\src\host\indexing.jl:48 [inlined]
  [6] reindex
    @ .\subarray.jl:268 [inlined]
  [7] reindex
    @ .\subarray.jl:264 [inlined]
  [8] getindex
    @ .\subarray.jl:290 [inlined]
  [9] _generic_matmatmul!(C::CuArray{…}, tA::Char, tB::Char, A::SubArray{…}, B::SubArray{…}, _add::LinearAlgebra.MulAddMul{…})
    @ LinearAlgebra C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:814
 [10] generic_matmatmul!(C::CuArray{…}, tA::Char, tB::Char, A::SubArray{…}, B::SubArray{…}, _add::LinearAlgebra.MulAddMul{…})
    @ LinearAlgebra C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:783
 [11] mul!
    @ C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:263 [inlined]
 [12] mul!
    @ C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:237 [inlined]
 [13] *
    @ C:\Users\awhite\.julia\juliaup\julia-1.10.3+0.x64.w64.mingw32\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:106 [inlined]
 [14] _settler_cover(fec_scope::CuArray{…}, conn::CuArray{…}, leftover_space::CuArray{…}, α::CuArray{…}, β::CuArray{…}, basal_area_per_settler::CuArray{…}, potential_settlers::CuArray{…}, valid_sources::CuArray{…}, valid_sinks::CuArray{…})
    @ ADRIA c:\Users\awhite\code\ADRIA.jl\src\ecosystem\corals\growth.jl:718
 [15] settler_cover_cuda(fec_scope::Matrix{…}, conn::Matrix{…}, leftover_space::Matrix{…}, α::Vector{…}, β::Vector{…}, basal_area_per_settler::Vector{…}, potential_settlers::Matrix{…})

see https://cuda.juliagpu.org/stable/usage/workflow/#UsageWorkflowScalar

https://cuda.juliagpu.org/stable/usage/array/#Array-wrappers

Certain common operations, like broadcast or matrix multiplication, do know how to deal with array wrappers by using the Adapt.jl package. This is still not a complete solution though, e.g. new array wrappers are not covered, and only one level of wrapping is supported. Sometimes the only solution is to materialize the wrapper to a CuArray again.

Notes

pkg add CUDA takes a long time when it detects a Nvidia GPU. It downloads multiple runtime files. You can specify local=true to prevent this and use the system CUDA toolkit.
https://cuda.juliagpu.org/stable/installation/overview/#Using-a-local-CUDA

Also see https://cuda.juliagpu.org/stable/installation/overview/#Using-a-local-CUDA
On HPC g001, I use a different Julia depot path to get around this. But CUDA.set_runtime_version!(v"11.8"; local_toolkit=true) could possibly work as well?

arlowhite · 2024-05-02T06:23:37Z

Removing @views fixes the multiply error. Then it's not happy with replace in recruitment_rate
ERROR: Scalar indexing is disallowed.

  [5] getindex(A::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, I::Int64)
    @ GPUArrays C:\Users\awhite\.julia\packages\GPUArrays\OKkAu\src\host\indexing.jl:48
  [6] _replace!(new::Base.var"#new#395"{…}, res::CuArray{…}, A::CuArray{…}, count::Int64)
    @ Base .\set.jl:802
  [7] replace_pairs!
    @ .\set.jl:616 [inlined]
  [8] #replace#397
    @ .\set.jl:690 [inlined]
  [9] replace
    @ .\set.jl:686 [inlined]
 [10] recruitment_rate(larval_pool::CuArray{…}, A::CuArray{…}; α::CuArray{…}, β::CuArray{…})
    @ ADRIA c:\Users\awhite\code\ADRIA.jl\src\ecosystem\corals\growth.jl:658

ConnectedSystems · 2024-05-02T14:37:06Z

I wonder at what point it becomes no longer worth the overhead of transferring data to the GPU.

For example, a small domain of < 100 locations or if conditions are such that there are low number of locations to be evaluated (e.g., low numbers of valid_sources and valid_sinks), does performance decrease in these cases or is it always faster to run on the GPU?

arlowhite · 2024-05-02T23:19:43Z

Ideally, all of settler_cover_cuda would be on the GPU, and then I think it would always be worth it. But I suspect memory is shifting back and forth too frequently the way it's coded now. I did try @cuda to compile the function as a kernel, but that didn't work and I don't totally understand how that works.

Working example of CUDA-fied `settler_cover()`

ConnectedSystems · 2024-05-02T23:39:00Z

But I suspect memory is shifting back and forth too frequently the way it's coded now

Yep, definitely too much data transfer happening, but now we have a first pass working example.

There are ways around this:
https://cuda.juliagpu.org/stable/lib/driver/#CUDA.Mem.UnifiedBuffer

arlowhite · 2024-05-03T06:28:20Z

Float32 is about twice as fast. So it goes from ~9x to 17x faster GPU vs CPU.

In the last few commits, I went back to using this original code. @ConnectedSystems impl (now 'settler_cover_cuda2`) is a bit faster, but I don't understand how it's equivalent.

    potential_settlers[:, valid_sinks] .= (
        fec_scope[:, valid_sources] * conn[valid_sources, valid_sinks]
    )

I tried cu(x, unified=true) but it was much slower.

arlowhite · 2024-05-03T06:29:07Z

Also, can't we cache valid_sources and valid_sinks generally? the speedup is significant.

ConnectedSystems · 2024-05-03T07:04:41Z

Also, can't we cache valid_sources and valid_sinks generally? the speedup is significant.

We can't cache, as it may change from time step to time step, but we can preallocate to reuse the same vector. That's what I was referring to in the comments.

ConnectedSystems · 2024-05-03T08:26:37Z

test_performance/growth_cuda.jl

+    # Calculate settler cover and copy result back to host
+    # This matrix multiplication is the most time-consuming part
+    # (`recruitment_rate()` takes < 1ms)
+    # FIXME reversed? dest param is first.


Bah, I think you're right, I accidentally switched it

arlowhite · 2024-08-14T05:02:04Z

Closing this for now. I don't think it's worth spending much time on CUDA until more GPU compute becomes available. At the moment, the strategy is that model runs should be done on the AIMS HPC, but we only have one GPU node.

ConnectedSystems · 2024-08-14T07:25:55Z

I thought I commented but it seemed to have disappeared.

The team has access to a dedicated remote desktop with a GPU. Someone else is using it hence why I asked you to prototype on the HPC.

arlowhite · 2024-08-14T22:40:13Z

You commented on #747

I did work on both the HPC and remote desktop. I usually close PRs if they're not merging soon and re-open them later when work resumes, but if your prefer to keep it open we can. At the moment, I'm focused on other tasks.

arlowhite added 3 commits April 24, 2024 10:33

settler_cover performance test

3e70ade

fix settler_cover signature and split-out _settler_cover

9ceac79

test cuda sandbox

eab0e35

ConnectedSystems added 5 commits May 2, 2024 23:31

Explicit local vars to silence warnings

2e99ed0

Updated manifest

705d2d9

Working example

566d2dd

Only accept CUDA arrays to avoid constant copying

2d6a5f8

Change benchmark approach to better reflect use case

48fbf22

Merge pull request #747 from open-AIMS/cuda-working-example

505728c

Working example of CUDA-fied `settler_cover()`

arlowhite added 4 commits May 3, 2024 10:49

add hypothetical benchmarks

e21ad74

benchmark ratio

dc69c39

compare CUDA impls

138ee25

float32 is faster, hack function signatures

20035c5

ConnectedSystems reviewed May 3, 2024

View reviewed changes

arlowhite closed this Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA version of settler_cover #746

CUDA version of settler_cover #746

arlowhite commented May 2, 2024 •

edited

Loading

arlowhite commented May 2, 2024 •

edited

Loading

ConnectedSystems commented May 2, 2024

arlowhite commented May 2, 2024

ConnectedSystems commented May 2, 2024

arlowhite commented May 3, 2024

arlowhite commented May 3, 2024

ConnectedSystems commented May 3, 2024

ConnectedSystems May 3, 2024 •

edited

Loading

arlowhite commented Aug 14, 2024

ConnectedSystems commented Aug 14, 2024

arlowhite commented Aug 14, 2024 •

edited

Loading

CUDA version of settler_cover #746

CUDA version of settler_cover #746

Conversation

arlowhite commented May 2, 2024 • edited Loading

Draft Status

CUDA ext

CUDA in base pkg

Blocking CUDA Error

Notes

arlowhite commented May 2, 2024 • edited Loading

ConnectedSystems commented May 2, 2024

arlowhite commented May 2, 2024

ConnectedSystems commented May 2, 2024

arlowhite commented May 3, 2024

arlowhite commented May 3, 2024

ConnectedSystems commented May 3, 2024

ConnectedSystems May 3, 2024 • edited Loading

Choose a reason for hiding this comment

arlowhite commented Aug 14, 2024

ConnectedSystems commented Aug 14, 2024

arlowhite commented Aug 14, 2024 • edited Loading

arlowhite commented May 2, 2024 •

edited

Loading

arlowhite commented May 2, 2024 •

edited

Loading

ConnectedSystems May 3, 2024 •

edited

Loading

arlowhite commented Aug 14, 2024 •

edited

Loading