-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Fix @simd for non 1 step CartesianPartition
#42736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nanosoldier |
|
Looks like tests mostly failed because OpenBLAS tries to start too many threads and exhausts the kernel resources, but not related to this PR. |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
Benchmark code:
bench1(f::OP, a, b = ()) where OP = begin
@inbounds @simd for i in eachindex(IndexCartesian(), a)
a[i] = f(map(x -> (@inbounds x[i]), b)...)
end
end
bench2(f::OP, a, b = ()) where OP = begin
iter = view(eachindex(IndexCartesian(), a), 2:length(a) - 1)
@inbounds @simd for i in iter
a[i] = f(map(x -> (@inbounds x[i]), b)...)
end
end
println("----------fill!------------")
N = 2^16
for n = 2:4
a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
println("N = ", n)
@btime bench1(Returns(1), $a)
@btime bench2(Returns(1), $a)
end
println("---------- .* ------------")
N = 2^16
for n = 2:4
a = zeros(N>>(3n-3), ntuple(_ -> 8, n - 1)...)
b = randn(size(a))
c = randn(size(a))
println("N = ", n)
@btime bench1(*, $a, ($c, $b))
@btime bench2(*, $a, ($c, $b))
endResult:
----------fill!------------
N = 2
10.800 μs (0 allocations: 0 bytes)
11.100 μs (0 allocations: 0 bytes)
N = 3
10.700 μs (0 allocations: 0 bytes)
10.900 μs (0 allocations: 0 bytes)
N = 4
11.100 μs (0 allocations: 0 bytes)
10.800 μs (0 allocations: 0 bytes)
---------- .* ------------
N = 2
22.700 μs (0 allocations: 0 bytes)
22.800 μs (0 allocations: 0 bytes)
N = 3
22.700 μs (0 allocations: 0 bytes)
22.800 μs (0 allocations: 0 bytes)
N = 4
22.800 μs (0 allocations: 0 bytes)
23.300 μs (0 allocations: 0 bytes) |
|
@vtjnash I notice that a few commits were pushed after you added the |
|
But anyway, all CI is green (except for |
|
@nanosoldier |
|
@N5N3 Coukd you also add some Benchmarks for this to BaseBenchmarks.jl? |
|
I’d like to, but I’m quite unfamiliar with BaseBenchmarks.jl. |
Yes please. This is quite sensitive to future code changes or unrelated changes in part of the pipeline. |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
Seemed like nanosoldier perf test failed also, whereas merge-me means if tests are good. |
|
IIUC,
Anyway I agree that performance should be tested before we merge this PR. |
|
It tooks about 30ns to generate 2d reshaped By calling 3-args |
|
Looks like all tests failed because of network problem. |
|
Just notice that reused On the other hand, I still think we'd better make |
ba8e114 to
86a627a
Compare
86a627a to
8fef013
Compare
8fef013 to
86d4375
Compare
|
Bump? |
revert changes in reshapedarray.jl use Iterators.rest
move `@inbounds` outside the loop body. see JuliaLang#38086
`first(Base.axes1())` works well, but `length(::StepRange)` won't be omitted by LLVM if `step` is unknown at compile time.
86d4375 to
90d0ea1
Compare
|
Rebased to get another CI run. |
|
Thanks for pick this up. |
… of `ReshapedArray` (#43518) This performance difference was found when working on #42736. Currently, our `ReshapedArray` use stride based `MultiplicativeInverse` to speed up index transformation. For example, for `a::AbstractArray{T,3}` and `b = vec(a)`, the index transformation is equivalent to: ```julia offset = i - 1 # b[i] d1, r1 = divrem(offset, stride(a, 3)) # stride(a, 3) = size(a, 1) * size(a, 2) d2, r2 = divrem(r1, stride(a, 2)) # stride(a, 2) = size(a, 1) CartesianIndex(r2 + 1, d2 +1, d1 + 1) # a has one-based axes ``` (All the `stride` is replaced with a `MultiplicativeInverse` to accelerate `divrem`) This PR wants to replace the above machinery with: ```julia offset = i - 1 d1, r1 = divrem(offset, size(a, 1)) d2, r2 = divrem(d1, size(a, 2)) CartesianIndex(r1 + 1, r2 +1, d2 + 1) ``` For random access, they should have the same computational cost. But for sequential access, like `sum(b)`, `size` based transformation seems faster. To avoid bottleneck from IO, use `reshape(::CartesianIndices, x...)` to benchmark: ```julia f(x) = let r = 0 for i in eachindex(x) @inbounds r |= +(x[i].I...) end r end a = CartesianIndices((99,100,101)); @Btime f(vec($a)); #2.766 ms --> 2.591 ms @Btime f(reshape($a,990,1010)); #3.412 ms --> 2.626 ms @Btime f(reshape($a,33,300,101)); #3.422 ms --> 2.342 ms ``` I haven't looked into the reason for this performance difference. Beside acceleration, this also makes it possible to reuse the `MultiplicativeInverse` in some cases (like #42736). So I think it might be useful? --------- Co-authored-by: Andy Dienes <[email protected]> Co-authored-by: Andy Dienes <[email protected]>
Previous code assumes that the steps of each axes are 1, which is not valid after #37829.
This PR tries to add support for non 1 step
CartesianPartition.Before
After
Besides the above fix, this PR also tries to simplify the inner loop with a
Generatorbased outer range. It accelerates 3/4d CartesianPartition a little on my desktop.Some benchmarks: