Try fast linear indexes for KA #2612

vchuravy · 2025-01-09T12:49:35Z

I am unsure why we couldn't do that in the beginning.

maleadt · 2025-01-09T13:12:20Z

Will have to wait for #2593 to get merged.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `c23ab7f`	Previous: `14ae82d`	Ratio
`latency/precompile`	`45424277593.5` ns	`45345622059` ns	`1.00`
`latency/ttfp`	`6384836106.5` ns	`6434638936` ns	`0.99`
`latency/import`	`3034614553` ns	`3051828695.5` ns	`0.99`
`integration/volumerhs`	`9567744` ns	`9568259` ns	`1.00`
`integration/byval/slices=1`	`146777` ns	`146590` ns	`1.00`
`integration/byval/slices=3`	`425370` ns	`425823` ns	`1.00`
`integration/byval/reference`	`144803` ns	`144766` ns	`1.00`
`integration/byval/slices=2`	`286177` ns	`286423` ns	`1.00`
`integration/cudadevrt`	`103434` ns	`103488.5` ns	`1.00`
`kernel/indexing`	`13938` ns	`14282.5` ns	`0.98`
`kernel/indexing_checked`	`15060` ns	`15333` ns	`0.98`
`kernel/occupancy`	`701.1063829787234` ns	`720.4492753623189` ns	`0.97`
`kernel/launch`	`2124.1111111111113` ns	`2130.5` ns	`1.00`
`kernel/rand`	`16334` ns	`17397` ns	`0.94`
`array/reverse/1d`	`19520` ns	`19471` ns	`1.00`
`array/reverse/2d`	`24603` ns	`24536` ns	`1.00`
`array/reverse/1d_inplace`	`10031.666666666666` ns	`10836.333333333334` ns	`0.93`
`array/reverse/2d_inplace`	`11528` ns	`11284` ns	`1.02`
`array/copy`	`20270` ns	`20310` ns	`1.00`
`array/iteration/findall/int`	`159097` ns	`158042` ns	`1.01`
`array/iteration/findall/bool`	`139369` ns	`138224` ns	`1.01`
`array/iteration/findfirst/int`	`153853` ns	`154038.5` ns	`1.00`
`array/iteration/findfirst/bool`	`154627.5` ns	`155126` ns	`1.00`
`array/iteration/scalar`	`75657` ns	`76714` ns	`0.99`
`array/iteration/logical`	`207799` ns	`214056.5` ns	`0.97`
`array/iteration/findmin/1d`	`41128` ns	`41628` ns	`0.99`
`array/iteration/findmin/2d`	`94766` ns	`94463` ns	`1.00`
`array/reductions/reduce/1d`	`38659` ns	`51305` ns	`0.75`
`array/reductions/reduce/2d`	`44155.5` ns	`42302` ns	`1.04`
`array/reductions/mapreduce/1d`	`37246.5` ns	`44898.5` ns	`0.83`
`array/reductions/mapreduce/2d`	`51913.5` ns	`52966.5` ns	`0.98`
`array/broadcast`	`21698` ns	`21607` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`11663` ns	`13399` ns	`0.87`
`array/copyto!/cpu_to_gpu`	`213248` ns	`213579.5` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`246883` ns	`245985.5` ns	`1.00`
`array/accumulate/1d`	`108538` ns	`109003` ns	`1.00`
`array/accumulate/2d`	`79961` ns	`79807` ns	`1.00`
`array/construct`	`1197.35` ns	`1147.9` ns	`1.04`
`array/random/randn/Float32`	`43009` ns	`43138` ns	`1.00`
`array/random/randn!/Float32`	`26240` ns	`26215` ns	`1.00`
`array/random/rand!/Int64`	`27084` ns	`27096` ns	`1.00`
`array/random/rand!/Float32`	`8824.666666666666` ns	`8869.333333333334` ns	`0.99`
`array/random/rand/Int64`	`29684` ns	`29884` ns	`0.99`
`array/random/rand/Float32`	`12772` ns	`12925` ns	`0.99`
`array/permutedims/4d`	`65015` ns	`67255` ns	`0.97`
`array/permutedims/2d`	`56278` ns	`56783` ns	`0.99`
`array/permutedims/3d`	`60503.5` ns	`58969.5` ns	`1.03`
`array/sorting/1d`	`2920400.5` ns	`2933376.5` ns	`1.00`
`array/sorting/by`	`3499981` ns	`3499572.5` ns	`1.00`
`array/sorting/2d`	`1084450` ns	`1084491.5` ns	`1.00`
`cuda/synchronization/stream/auto`	`1027.8` ns	`1039.3` ns	`0.99`
`cuda/synchronization/stream/nonblocking`	`6532.2` ns	`6569.6` ns	`0.99`
`cuda/synchronization/stream/blocking`	`804.8350515463917` ns	`796.7647058823529` ns	`1.01`
`cuda/synchronization/context/auto`	`1166.8` ns	`1224.5` ns	`0.95`
`cuda/synchronization/context/nonblocking`	`6741.4` ns	`6745.4` ns	`1.00`
`cuda/synchronization/context/blocking`	`891.9583333333334` ns	`915.2391304347826` ns	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2025-01-09T17:26:52Z

A handful of tests fail on this PR:

julia> CUDA.@sync sum!(CUDA.rand(Float64, (1, 1025, 2)), CUDA.rand(Float64, (1, 1025, 2)))
ERROR: a BoundsError was thrown during kernel execution on thread (513, 1, 1) in block (3, 1, 1).
Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/tim/Julia/pkg/CUDA/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

ERROR: KernelException: exception thrown during kernel execution on device NVIDIA RTX 6000 Ada Generation
Stacktrace:
 [1] check_exceptions()
   @ CUDA ~/Julia/pkg/CUDA/src/compiler/exceptions.jl:39
 [2] synchronize(stream::CuStream; blocking::Bool, spin::Bool)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/synchronization.jl:207
 [3] synchronize
   @ ~/Julia/pkg/CUDA/lib/cudadrv/synchronization.jl:194 [inlined]
 [4] macro expansion
   @ ~/Julia/pkg/CUDA/src/utilities.jl:36 [inlined]
 [5] top-level scope
   @ REPL[44]:1

Not a catastrophic amount though, so probably worth looking into?

Test Summary:                                  |  Pass  Error  Broken  Total  Time
  Overall                                      | 25854     15      11  25880

maleadt · 2025-01-09T17:56:58Z

Not [...] catastrophic though

Well, not so sure about that

julia> CUDA.@sync CUDA.ones(Float64, (1, 1025, 2));
ERROR: a BoundsError was thrown during kernel execution on thread (353, 1, 1) in block (4, 1, 1).
Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/tim/Julia/pkg/CUDA/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

maleadt · 2025-01-09T18:12:23Z

MWE for the bounds error:

function main()
    A = CuArray{Float64}(undef, (1, 1025, 2))

    @kernel function fill_kernel!(a)
        idx = @index(Global, Linear)
        if idx >= length(a)
            if idx == length(a)+1
                @cushow threadIdx().x blockDim().x blockIdx().x gridDim().x idx
            end
        else
            a[idx] = 0f0
        end
    end

    kernel = fill_kernel!(get_backend(A))
    CUDA.@sync kernel(A; ndrange = size(A))
end

The linear index here goes out of bounds for a lot of threads, so I limited to only printing about the first one:

(threadIdx()).x = 259
(blockDim()).x = 896
(blockIdx()).x = 3
(gridDim()).x = 4
idx = 2051

The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no? In any case, it's also strange that this isn't detected by the bounds check I presume @index is doing...

maleadt · 2025-01-09T18:21:37Z

The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no?

This is how KA's launch configuration determines that:

ndrange = (1, 1025, 2)
config.threads = 896
workgroupsize = threads_to_workgroupsize(threads, ndrange) = (1, 896, 1)
KA.blocks(iterspace) = CartesianIndices((1, 2, 2))
KA.workitems(iterspace) = CartesianIndices((1, 896, 1))
blocks = length(KA.blocks(iterspace)) = 4
threads = length(KA.workitems(iterspace)) = 896

Regardless of the (somehow) missing bounds check here, it seems very wasteful to launch (1, 2, 2) = 4 groups to cover (1, 1025, 2) items if we end up generating a ND index from a linear one anyway (i.e., without having to cover each dimension individually)...

@vchuravy I'll defer to you on this.

github-actions bot reviewed Jan 9, 2025

View reviewed changes

Try fast linear indexes for KA

2a2b844

maleadt force-pushed the vc/fast_linear_global branch from c23ab7f to 2a2b844 Compare January 9, 2025 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try fast linear indexes for KA #2612

Try fast linear indexes for KA #2612

vchuravy commented Jan 9, 2025

maleadt commented Jan 9, 2025

github-actions bot left a comment

maleadt commented Jan 9, 2025

maleadt commented Jan 9, 2025

maleadt commented Jan 9, 2025

maleadt commented Jan 9, 2025 •

edited

Loading

Try fast linear indexes for KA #2612

Are you sure you want to change the base?

Try fast linear indexes for KA #2612

Conversation

vchuravy commented Jan 9, 2025

maleadt commented Jan 9, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

maleadt commented Jan 9, 2025

maleadt commented Jan 9, 2025

maleadt commented Jan 9, 2025

maleadt commented Jan 9, 2025 • edited Loading

maleadt commented Jan 9, 2025 •

edited

Loading