-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try fast linear indexes for KA #2612
base: master
Are you sure you want to change the base?
Conversation
Will have to wait for #2593 to get merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: c23ab7f | Previous: 14ae82d | Ratio |
---|---|---|---|
latency/precompile |
45424277593.5 ns |
45345622059 ns |
1.00 |
latency/ttfp |
6384836106.5 ns |
6434638936 ns |
0.99 |
latency/import |
3034614553 ns |
3051828695.5 ns |
0.99 |
integration/volumerhs |
9567744 ns |
9568259 ns |
1.00 |
integration/byval/slices=1 |
146777 ns |
146590 ns |
1.00 |
integration/byval/slices=3 |
425370 ns |
425823 ns |
1.00 |
integration/byval/reference |
144803 ns |
144766 ns |
1.00 |
integration/byval/slices=2 |
286177 ns |
286423 ns |
1.00 |
integration/cudadevrt |
103434 ns |
103488.5 ns |
1.00 |
kernel/indexing |
13938 ns |
14282.5 ns |
0.98 |
kernel/indexing_checked |
15060 ns |
15333 ns |
0.98 |
kernel/occupancy |
701.1063829787234 ns |
720.4492753623189 ns |
0.97 |
kernel/launch |
2124.1111111111113 ns |
2130.5 ns |
1.00 |
kernel/rand |
16334 ns |
17397 ns |
0.94 |
array/reverse/1d |
19520 ns |
19471 ns |
1.00 |
array/reverse/2d |
24603 ns |
24536 ns |
1.00 |
array/reverse/1d_inplace |
10031.666666666666 ns |
10836.333333333334 ns |
0.93 |
array/reverse/2d_inplace |
11528 ns |
11284 ns |
1.02 |
array/copy |
20270 ns |
20310 ns |
1.00 |
array/iteration/findall/int |
159097 ns |
158042 ns |
1.01 |
array/iteration/findall/bool |
139369 ns |
138224 ns |
1.01 |
array/iteration/findfirst/int |
153853 ns |
154038.5 ns |
1.00 |
array/iteration/findfirst/bool |
154627.5 ns |
155126 ns |
1.00 |
array/iteration/scalar |
75657 ns |
76714 ns |
0.99 |
array/iteration/logical |
207799 ns |
214056.5 ns |
0.97 |
array/iteration/findmin/1d |
41128 ns |
41628 ns |
0.99 |
array/iteration/findmin/2d |
94766 ns |
94463 ns |
1.00 |
array/reductions/reduce/1d |
38659 ns |
51305 ns |
0.75 |
array/reductions/reduce/2d |
44155.5 ns |
42302 ns |
1.04 |
array/reductions/mapreduce/1d |
37246.5 ns |
44898.5 ns |
0.83 |
array/reductions/mapreduce/2d |
51913.5 ns |
52966.5 ns |
0.98 |
array/broadcast |
21698 ns |
21607 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11663 ns |
13399 ns |
0.87 |
array/copyto!/cpu_to_gpu |
213248 ns |
213579.5 ns |
1.00 |
array/copyto!/gpu_to_cpu |
246883 ns |
245985.5 ns |
1.00 |
array/accumulate/1d |
108538 ns |
109003 ns |
1.00 |
array/accumulate/2d |
79961 ns |
79807 ns |
1.00 |
array/construct |
1197.35 ns |
1147.9 ns |
1.04 |
array/random/randn/Float32 |
43009 ns |
43138 ns |
1.00 |
array/random/randn!/Float32 |
26240 ns |
26215 ns |
1.00 |
array/random/rand!/Int64 |
27084 ns |
27096 ns |
1.00 |
array/random/rand!/Float32 |
8824.666666666666 ns |
8869.333333333334 ns |
0.99 |
array/random/rand/Int64 |
29684 ns |
29884 ns |
0.99 |
array/random/rand/Float32 |
12772 ns |
12925 ns |
0.99 |
array/permutedims/4d |
65015 ns |
67255 ns |
0.97 |
array/permutedims/2d |
56278 ns |
56783 ns |
0.99 |
array/permutedims/3d |
60503.5 ns |
58969.5 ns |
1.03 |
array/sorting/1d |
2920400.5 ns |
2933376.5 ns |
1.00 |
array/sorting/by |
3499981 ns |
3499572.5 ns |
1.00 |
array/sorting/2d |
1084450 ns |
1084491.5 ns |
1.00 |
cuda/synchronization/stream/auto |
1027.8 ns |
1039.3 ns |
0.99 |
cuda/synchronization/stream/nonblocking |
6532.2 ns |
6569.6 ns |
0.99 |
cuda/synchronization/stream/blocking |
804.8350515463917 ns |
796.7647058823529 ns |
1.01 |
cuda/synchronization/context/auto |
1166.8 ns |
1224.5 ns |
0.95 |
cuda/synchronization/context/nonblocking |
6741.4 ns |
6745.4 ns |
1.00 |
cuda/synchronization/context/blocking |
891.9583333333334 ns |
915.2391304347826 ns |
0.97 |
This comment was automatically generated by workflow using github-action-benchmark.
c23ab7f
to
2a2b844
Compare
A handful of tests fail on this PR:
Not a catastrophic amount though, so probably worth looking into?
|
Well, not so sure about that
|
MWE for the bounds error: function main()
A = CuArray{Float64}(undef, (1, 1025, 2))
@kernel function fill_kernel!(a)
idx = @index(Global, Linear)
if idx >= length(a)
if idx == length(a)+1
@cushow threadIdx().x blockDim().x blockIdx().x gridDim().x idx
end
else
a[idx] = 0f0
end
end
kernel = fill_kernel!(get_backend(A))
CUDA.@sync kernel(A; ndrange = size(A))
end The linear index here goes out of bounds for a lot of threads, so I limited to only printing about the first one:
The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no? In any case, it's also strange that this isn't detected by the bounds check I presume |
This is how KA's launch configuration determines that:
Regardless of the (somehow) missing bounds check here, it seems very wasteful to launch @vchuravy I'll defer to you on this. |
I am unsure why we couldn't do that in the beginning.