实现AbstractArray过程中遇到的性能问题

Jun · 2018 年9 月 4 日 04:53

最近需要实现一个CircularBuffer，嗯，不是DataStrucutres.jl中的那个CircularBuffer。抛开具体细节。目前遇到了两个疑惑：

在给Array的一个slice赋值的时候，因为提前知道了长度，似乎比(:)的语法更快？
测了几组数据，稳定的两倍关系

julia> xs = Array{Float64}(undef, (100, 100, 100, 100));

julia> using BenchmarkTools

julia> @benchmark xs[100*100*100*99 + 1 : 100*100*100*100] = $(randn(100, 100, 100))
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     478.000 μs (0.00% GC)
  median time:      562.000 μs (0.00% GC)
  mean time:        601.356 μs (0.00% GC)
  maximum time:     1.958 ms (0.00% GC)
  --------------
  samples:          8203
  evals/sample:     1

julia> @benchmark xs[:, :, :, 100]= $(randn(100, 100, 100))
BenchmarkTools.Trial:
  memory estimate:  864 bytes
  allocs estimate:  31
  --------------
  minimum time:     1.224 ms (0.00% GC)
  median time:      1.389 ms (0.00% GC)
  mean time:        1.416 ms (0.00% GC)
  maximum time:     3.056 ms (0.00% GC)
  --------------
  samples:          3494
  evals/sample:     1

看源码的实现似乎是因为基于Colon会多跑几次?

using BenchmarkTools
import Base: getindex, setindex!, size
struct A{T, N} <: AbstractArray{T, N}
    data::Array{T, N}
end
size(x::A) = size(x.data)
getindex(x::A{T, N}, I::Vararg{Int, N}) where {T,N} = getindex(x.data, I[1:N-1]..., 1)
a = A(Array{Float64}(undef, (100, 100, 100, 100)));
@benchmark a[:, :, :, 1]

BenchmarkTools.Trial:
  memory estimate:  221.25 MiB
  allocs estimate:  5000005
  --------------
  minimum time:     661.683 ms (2.35% GC)
  median time:      721.564 ms (2.20% GC)
  mean time:        728.419 ms (3.18% GC)
  maximum time:     814.892 ms (8.24% GC)
  --------------
  samples:          7
  evals/sample:     1

这里不太理解为什么会有那么多内存分配。上面getindex(x.data, I[1:N-1]..., 1)只是一个用来说明问题的例子，实际中1是计算出来的。下面是对比

julia> @benchmark a.data[:, :, :, 1]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  5
  --------------
  minimum time:     1.225 ms (0.00% GC)
  median time:      1.472 ms (0.00% GC)
  mean time:        1.559 ms (6.20% GC)
  maximum time:     3.675 ms (0.00% GC)
  --------------
  samples:          3168
  evals/sample:     1

Gnimuc · 2018 年9 月 4 日 05:27

应该是splitting penalty，getindex(x::A{T, N}, I::Vararg{Int, N}) where {T,N} = getindex(x.data, I[1], I[2], I[3], 1)就不会allocate。

X-ref:

Jun · 2018 年9 月 4 日 06:00

多谢~

那我也明白了，为啥文档里关于IndexCartesia()强调了下面这点

IndexCartesian() arrays, on the other hand, require methods to be defined for each supported dimensionality with ndims(A) Int indices.

Gnimuc · 2018 年9 月 4 日 07:31

第一个问题用 in-place 的 broadcast 会快一些 xs[:, :, :, 100] .= randn(100, 100, 100) 但仍然有 1.5 倍差距。

但这个现象似乎取决于 Array 的大小：

julia> xs = zeros(10,10,10,10);

julia> A = randn(10,10,10);

julia> foo(x, A) = (x[9001:10000] = A)
foo (generic function with 1 method)

julia> @btime foo($xs, $A);
  148.062 ns (0 allocations: 0 bytes)

julia> hoo(x, A) = (x[:,:,:,10] .= A)
hoo (generic function with 1 method)

julia> @btime hoo($xs, $A);
  98.186 ns (1 allocation: 64 bytes)

julia> goo(x, A) = (x[:,:,:,10] = A)
goo (generic function with 1 method)

julia> @btime goo($xs, $A);
  1.095 μs (1 allocation: 48 bytes)

foo 会走 unsafe_copyto! 直接 memmove 一整块内存。goo 会走 _unsafe_setindex! 貌似就是普通的循环，而 hoo 可以正确生成向量化代码所以更快一些。

X-ref: Lack of vectorization when assigning view to slice of an array · Issue #28331 · JuliaLang/julia · GitHub