数组inplace操作

Qling · 2021 年3 月 30 日 11:14

我定义了两个normalize函数，一个是non in-place版本，一个是in-place版本的

normalize(x::AbstractMatrix) = let 
    min_value, max_value = minimum(x), maximum(x)
    return (x .- min_value) ./ (max_value - min_value)
end

normalize!(x::AbstractMatrix) = let
    min_value, max_value = minimum(x), maximum(x)

    @. x = (x - min_value) / (max_value - min_value)
end

当我测试的时候，我发现，in-place版本的速度并没有比non in-place版本的快，这是为何？

img = randn(10_000, 10_000)

当使用non in-place版本：

@benchmark normalize($img)

结果如下：

BenchmarkTools.Trial: 
  memory estimate:  762.94 MiB
  allocs estimate:  2
  --------------
  minimum time:     223.861 ms (0.36% GC)
  median time:      245.247 ms (7.16% GC)
  mean time:        246.892 ms (7.82% GC)
  maximum time:     314.127 ms (27.65% GC)
  --------------
  samples:          21
  evals/sample:     1

当使用in-place版本

@benchmark normalize!($img)

结果如下：

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     240.636 ms (0.00% GC)
  median time:      247.378 ms (0.00% GC)
  mean time:        246.763 ms (0.00% GC)
  maximum time:     259.142 ms (0.00% GC)
  --------------
  samples:          21
  evals/sample:     1

看起来in-place版本并没有加速多少，甚至在minimum time以及median time上还稍慢一些。这是为啥呢？

Sukanka · 2021 年3 月 30 日 15:27

我发现实际上这样子更快，当然我也不知道为什么，坐等大佬解答（话说你的电脑不错啊）。

julia> normalize2!(x::AbstractMatrix) = let
           min_value, max_value = minimum(x), maximum(x)
       for i in eachindex(x)
          x[i]= (x[i] - min_value) / (max_value - min_value)
       end
       end
normalize2! (generic function with 1 method)

julia> img = randn(10_000, 10_000)

julia> @benchmark normalize($img)
BenchmarkTools.Trial: 
  memory estimate:  762.94 MiB
  allocs estimate:  2
  --------------
  minimum time:     248.736 ms (0.00% GC)
  median time:      263.089 ms (0.00% GC)
  mean time:        274.860 ms (7.37% GC)
  maximum time:     362.598 ms (29.22% GC)
  --------------
  samples:          19
  evals/sample:     1

julia> @benchmark normalize!($img)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     253.117 ms (0.00% GC)
  median time:      254.231 ms (0.00% GC)
  mean time:        256.409 ms (0.00% GC)
  maximum time:     270.332 ms (0.00% GC)
  --------------
  samples:          20
  evals/sample:     1

julia> @benchmark normalize2!($img)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     192.581 ms (0.00% GC)
  median time:      207.008 ms (0.00% GC)
  mean time:        214.320 ms (0.00% GC)
  maximum time:     259.970 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1

Qling · 2021 年3 月 30 日 16:03

我也试了一下 for循环好像会比点运算快，我猜可能是广播的时候会有一些时间上的消耗 (你的也不赖呀哈哈 AMD yes！)

henry2004y · 2021 年4 月 2 日 19:11

我以前的测试中也有看到类似的结果。我的感觉是dot fusion相比于explicit loop需要一些额外的中间变量分配，虽然不多，但是会导致前者略微慢一些。

我笔记本上的测试结果，intel i7-10750H:

julia> @benchmark normalize($img)
BenchmarkTools.Trial:
  memory estimate:  762.94 MiB
  allocs estimate:  2
  --------------
  minimum time:     496.196 ms (0.06% GC)
  median time:      500.416 ms (0.06% GC)
  mean time:        553.995 ms (10.21% GC)
  maximum time:     671.142 ms (26.15% GC)
  --------------
  samples:          10
  evals/sample:     1

julia> img = randn(10_000, 10_000);

julia> @benchmark normalize!($img)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     454.142 ms (0.00% GC)
  median time:      454.749 ms (0.00% GC)
  mean time:        454.869 ms (0.00% GC)
  maximum time:     455.670 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1
julia> img = randn(10_000, 10_000);

julia> @benchmark normalize2!($img)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     371.282 ms (0.00% GC)
  median time:      371.868 ms (0.00% GC)
  mean time:        372.023 ms (0.00% GC)
  maximum time:     373.091 ms (0.00% GC)
  --------------
  samples:          14
  evals/sample:     1

inplace 比 non-inplace 稍微快一些。对于更大规模的数据也是类似的比例。

Qling · 2021 年4 月 3 日 04:26

ok 所以non-inplace确实是比in-place要稍稍快一些。虽然不知道为啥但还是觉得有点神奇。我以为在non-inplace版本中，开辟新内存会消耗更多的时间。

henry2004y · 2021 年4 月 3 日 06:18

对不起之前写反了。不开辟新内存更快。看我贴的测试数字。可能garbage collection是主要影响?

Qling · 2021 年4 月 3 日 06:45

咦真的抱歉我也看反了Hh 不过为啥我们测试起来的数据反而不一样。。有点奇怪

songxianxu · 2021 年4 月 3 日 17:09

我的测试：

In-place 要比 non-inplace 要快
手写 for 会比 broadcast 快（比如 dot syntax 的 loop 或者 map）

一些我以前收集的:

参见 Steven G. Johnson 的这个回复（虽然不完全是同一个问题但是似乎@. x = (x - min_value) / (max_value - min_value) 也有类似的问题。 Operating on many arrays of the same size: one loop, many loops, or broadcasting? - #2 by stevengj - Performance - Julia Programming Language
这个是我以前收集的另外一个对比（当时我好像在测试 map 的性能，发现计算量太大用 map 也许不是很合适了…）Benchmarking maps, loops, generators and broadcasting in Julia | Dean Markwick
这个比较早期的介绍 More Dots: Syntactic Loop Fusion in Julia （然而我没看完…）

johnnychen94 · 2021 年4 月 4 日 08:03

我的猜测是数据量太大的情况下会破坏CPU三级缓存，从而导致性能降低：

function normalize_fuse(x::AbstractMatrix)
    min_value, max_value = extrema(x)
    @. (x - min_value) / (max_value - min_value)
end

function normalize_for!(x::AbstractMatrix)
    min_value, max_value = extrema(x)
    @inbounds @simd for i in eachindex(x)
        x[i]= (x[i] - min_value) / (max_value - min_value)
    end
    return x
end

function normalize_fuse!(x::AbstractMatrix)
    min_value, max_value = extrema(x)
    @. x = (x - min_value) / (max_value - min_value)
end

# 存储一些中间结果
function normalize_for2!(x::AbstractMatrix)
    min_value, max_value = extrema(x)
    tmp1 = 1/(max_value - min_value)
    tmp2 = -min_value * tmp1
    @inbounds @simd for i in eachindex(x)
        x[i] = x[i] * tmp1 + tmp2
    end
end

function normalize_fuse2!(x::AbstractMatrix)
    min_value, max_value = extrema(x)
    tmp1 = 1/(max_value - min_value)
    tmp2 = -min_value * tmp1
    @. x = x * tmp1 + tmp2
end

他们在小数据的表现上其实是基本一致的，大概是因为CPU缓存的有效性没被破坏：

img = rand(100, 100);
@btime normalize_fuse($img); # 28.268 μs (2 allocations: 78.20 KiB)
@btime normalize_for!($img); # 28.058 μs (0 allocations: 0 bytes)
@btime normalize_fuse!($img); # 32.686 μs (0 allocations: 0 bytes)
@btime normalize_for2!($img); # 25.491 μs (0 allocations: 0 bytes)
@btime normalize_fuse2!($img); # 27.549 μs (0 allocations: 0 bytes)

将数据规模放大10000倍之后，会发现这种性能差距实际上变大了：

img = rand(10_000, 10_000);
@btime normalize_fuse($img); # 650.347 ms (2 allocations: 762.94 MiB)
@btime normalize_for!($img); # 345.016 ms (0 allocations: 0 bytes)
@btime normalize_fuse!($img); # 404.579 ms (0 allocations: 0 bytes)
@btime normalize_for2!($img); # 341.994 ms (0 allocations: 0 bytes)
@btime normalize_fuse2!($img); # 356.055 ms (0 allocations: 0 bytes)

但 normalize_fuse2! 的结果还是比较接近，大概是因为 normalize_fuse2! 这个版本所涉及的外部量更少，缓存失效的影响也更少导致的。（这个解释有可能是错误的）

更有可能的原因是因为我们显式地告诉了编译器更多的信息，所以编译器能够做出更有效的优化工作来应对缓存失效这一问题。 for 循环版本的话等于是在代码逻辑里更加强调了数据的局部性，从而直接降低了缓存失效的可能性。

至于为什么小数据上 normalize_fuse! 比 normalize_fuse 要慢… 大概是因为分配一块连续的小内存的开销比多次写入到内存的开销要更小一些吧… x .= ... 这个操作涉及到了多次的内存写入，所以本身也会存在一些额外开销。内存写入这件事情本身也有可能破坏缓存的有效性。

Qling · 2021 年4 月 4 日 08:32

非常感谢各位的解答！！特别有用，涨知识了