Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try fast linear indexes for KA #2612

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Jan 9, 2025

I am unsure why we couldn't do that in the beginning.

@maleadt
Copy link
Member

maleadt commented Jan 9, 2025

Will have to wait for #2593 to get merged.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: c23ab7f Previous: 14ae82d Ratio
latency/precompile 45424277593.5 ns 45345622059 ns 1.00
latency/ttfp 6384836106.5 ns 6434638936 ns 0.99
latency/import 3034614553 ns 3051828695.5 ns 0.99
integration/volumerhs 9567744 ns 9568259 ns 1.00
integration/byval/slices=1 146777 ns 146590 ns 1.00
integration/byval/slices=3 425370 ns 425823 ns 1.00
integration/byval/reference 144803 ns 144766 ns 1.00
integration/byval/slices=2 286177 ns 286423 ns 1.00
integration/cudadevrt 103434 ns 103488.5 ns 1.00
kernel/indexing 13938 ns 14282.5 ns 0.98
kernel/indexing_checked 15060 ns 15333 ns 0.98
kernel/occupancy 701.1063829787234 ns 720.4492753623189 ns 0.97
kernel/launch 2124.1111111111113 ns 2130.5 ns 1.00
kernel/rand 16334 ns 17397 ns 0.94
array/reverse/1d 19520 ns 19471 ns 1.00
array/reverse/2d 24603 ns 24536 ns 1.00
array/reverse/1d_inplace 10031.666666666666 ns 10836.333333333334 ns 0.93
array/reverse/2d_inplace 11528 ns 11284 ns 1.02
array/copy 20270 ns 20310 ns 1.00
array/iteration/findall/int 159097 ns 158042 ns 1.01
array/iteration/findall/bool 139369 ns 138224 ns 1.01
array/iteration/findfirst/int 153853 ns 154038.5 ns 1.00
array/iteration/findfirst/bool 154627.5 ns 155126 ns 1.00
array/iteration/scalar 75657 ns 76714 ns 0.99
array/iteration/logical 207799 ns 214056.5 ns 0.97
array/iteration/findmin/1d 41128 ns 41628 ns 0.99
array/iteration/findmin/2d 94766 ns 94463 ns 1.00
array/reductions/reduce/1d 38659 ns 51305 ns 0.75
array/reductions/reduce/2d 44155.5 ns 42302 ns 1.04
array/reductions/mapreduce/1d 37246.5 ns 44898.5 ns 0.83
array/reductions/mapreduce/2d 51913.5 ns 52966.5 ns 0.98
array/broadcast 21698 ns 21607 ns 1.00
array/copyto!/gpu_to_gpu 11663 ns 13399 ns 0.87
array/copyto!/cpu_to_gpu 213248 ns 213579.5 ns 1.00
array/copyto!/gpu_to_cpu 246883 ns 245985.5 ns 1.00
array/accumulate/1d 108538 ns 109003 ns 1.00
array/accumulate/2d 79961 ns 79807 ns 1.00
array/construct 1197.35 ns 1147.9 ns 1.04
array/random/randn/Float32 43009 ns 43138 ns 1.00
array/random/randn!/Float32 26240 ns 26215 ns 1.00
array/random/rand!/Int64 27084 ns 27096 ns 1.00
array/random/rand!/Float32 8824.666666666666 ns 8869.333333333334 ns 0.99
array/random/rand/Int64 29684 ns 29884 ns 0.99
array/random/rand/Float32 12772 ns 12925 ns 0.99
array/permutedims/4d 65015 ns 67255 ns 0.97
array/permutedims/2d 56278 ns 56783 ns 0.99
array/permutedims/3d 60503.5 ns 58969.5 ns 1.03
array/sorting/1d 2920400.5 ns 2933376.5 ns 1.00
array/sorting/by 3499981 ns 3499572.5 ns 1.00
array/sorting/2d 1084450 ns 1084491.5 ns 1.00
cuda/synchronization/stream/auto 1027.8 ns 1039.3 ns 0.99
cuda/synchronization/stream/nonblocking 6532.2 ns 6569.6 ns 0.99
cuda/synchronization/stream/blocking 804.8350515463917 ns 796.7647058823529 ns 1.01
cuda/synchronization/context/auto 1166.8 ns 1224.5 ns 0.95
cuda/synchronization/context/nonblocking 6741.4 ns 6745.4 ns 1.00
cuda/synchronization/context/blocking 891.9583333333334 ns 915.2391304347826 ns 0.97

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt force-pushed the vc/fast_linear_global branch from c23ab7f to 2a2b844 Compare January 9, 2025 15:35
@maleadt
Copy link
Member

maleadt commented Jan 9, 2025

A handful of tests fail on this PR:

julia> CUDA.@sync sum!(CUDA.rand(Float64, (1, 1025, 2)), CUDA.rand(Float64, (1, 1025, 2)))
ERROR: a BoundsError was thrown during kernel execution on thread (513, 1, 1) in block (3, 1, 1).
Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/tim/Julia/pkg/CUDA/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

ERROR: KernelException: exception thrown during kernel execution on device NVIDIA RTX 6000 Ada Generation
Stacktrace:
 [1] check_exceptions()
   @ CUDA ~/Julia/pkg/CUDA/src/compiler/exceptions.jl:39
 [2] synchronize(stream::CuStream; blocking::Bool, spin::Bool)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/synchronization.jl:207
 [3] synchronize
   @ ~/Julia/pkg/CUDA/lib/cudadrv/synchronization.jl:194 [inlined]
 [4] macro expansion
   @ ~/Julia/pkg/CUDA/src/utilities.jl:36 [inlined]
 [5] top-level scope
   @ REPL[44]:1

Not a catastrophic amount though, so probably worth looking into?

Test Summary:                                  |  Pass  Error  Broken  Total  Time
  Overall                                      | 25854     15      11  25880

@maleadt
Copy link
Member

maleadt commented Jan 9, 2025

Not [...] catastrophic though

Well, not so sure about that

julia> CUDA.@sync CUDA.ones(Float64, (1, 1025, 2));
ERROR: a BoundsError was thrown during kernel execution on thread (353, 1, 1) in block (4, 1, 1).
Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/tim/Julia/pkg/CUDA/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

@maleadt
Copy link
Member

maleadt commented Jan 9, 2025

MWE for the bounds error:

function main()
    A = CuArray{Float64}(undef, (1, 1025, 2))

    @kernel function fill_kernel!(a)
        idx = @index(Global, Linear)
        if idx >= length(a)
            if idx == length(a)+1
                @cushow threadIdx().x blockDim().x blockIdx().x gridDim().x idx
            end
        else
            a[idx] = 0f0
        end
    end

    kernel = fill_kernel!(get_backend(A))
    CUDA.@sync kernel(A; ndrange = size(A))
end

The linear index here goes out of bounds for a lot of threads, so I limited to only printing about the first one:

(threadIdx()).x = 259
(blockDim()).x = 896
(blockIdx()).x = 3
(gridDim()).x = 4
idx = 2051

The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no? In any case, it's also strange that this isn't detected by the bounds check I presume @index is doing...

@maleadt
Copy link
Member

maleadt commented Jan 9, 2025

The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no?

This is how KA's launch configuration determines that:

ndrange = (1, 1025, 2)
config.threads = 896
workgroupsize = threads_to_workgroupsize(threads, ndrange) = (1, 896, 1)
KA.blocks(iterspace) = CartesianIndices((1, 2, 2))
KA.workitems(iterspace) = CartesianIndices((1, 896, 1))
blocks = length(KA.blocks(iterspace)) = 4
threads = length(KA.workitems(iterspace)) = 896

Regardless of the (somehow) missing bounds check here, it seems very wasteful to launch (1, 2, 2) = 4 groups to cover (1, 1025, 2) items if we end up generating a ND index from a linear one anyway (i.e., without having to cover each dimension individually)...

@vchuravy I'll defer to you on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants