Julia 的 dot 表达式开销对比

Junars · 2020 年7 月 4 日 08:29

Julia的dot expression感觉有点迷，但挺有趣

有这么两个数组

a = [1, 2, 3]
b = [2, 3, 4]


# 作用是相同的
julia> a + b
3-element Array{Int64,1}:
 3
 5
 7

julia> a .+ b
3-element Array{Int64,1}:
 3
 5
 7

# 开销却不一样
julia> @time a + b
  0.000003 seconds (1 allocation: 112 bytes)
3-element Array{Int64,1}:
 3
 5
 7

julia> @time a .+ b
  0.000020 seconds (3 allocations: 160 bytes)
3-element Array{Int64,1}:
 3
 5
 7

# 对于这样简单的 .+ 居然开销比 直接 a + b 大！
julia> @time a .+ b
  0.000011 seconds (3 allocations: 160 bytes)
3-element Array{Int64,1}:
 3
 5
 7

# 继续看
julia> @time a + 3b
  0.392247 seconds (157.90 k allocations: 7.896 MiB)
3-element Array{Int64,1}:
  7
 11
 15

# 结果很好，预料之中比 a + 3b 好！
julia> @time a .+ 3b
  0.000013 seconds (4 allocations: 272 bytes)
3-element Array{Int64,1}:
  7
 11
 15

# ？？？
julia> @time a .+ 3 .* b
  0.359544 seconds (205.21 k allocations: 10.400 MiB)
3-element Array{Int64,1}:
  7
 11
 15

# 就因为括号让你方便看懂了是么？！
julia> @time a .+ (3 .* b)
  0.000024 seconds (5 allocations: 208 bytes)
3-element Array{Int64,1}:
  7
 11
 15

# 其实不是，第一次难免的，再运行一次！
julia> @time a .+ 3 .* b
  0.000014 seconds (5 allocations: 208 bytes)
3-element Array{Int64,1}:
 17
 26
 35

# 不错，这样是最优的了，和上面一样
# 虽然时间稍微多了一点，但只是第一次，
# 多运行几次时间应该稳定在当前的1/3左右！
julia> @time @. a + 3b
  0.000032 seconds (5 allocations: 208 bytes)
3-element Array{Int64,1}:
  7
 11
 15

总体上看来是非常推荐使用 f.(x)的，但是 a .+ b 这种不推荐？

Junars · 2020 年7 月 4 日 08:41

julia> @time a .= a .+ b
  0.000009 seconds (2 allocations: 48 bytes)
3-element Array{Int64,1}:
  55
  83
 111

julia> @time a = a .+ b
  0.000010 seconds (3 allocations: 160 bytes)
3-element Array{Int64,1}:
  57
  86
 115

julia> @time a = a + b
  0.000004 seconds (1 allocation: 112 bytes)
3-element Array{Int64,1}:
  69
 104
 139

所以我得出的结论是？两个‘ . ’及以上建议！ 1个 ‘ . ’，emmm待定。。。

Sukanka · 2020 年7 月 4 日 09:21

用 BenchmarkTools.jl 的 @benchmark试一下？

Junars · 2020 年7 月 4 日 09:37

还没用过BenchmarkTools.jl，不过在Julia官方文档的Performance Tips里是建议向量化函数的，只是a .+ b 的情况，我连续运行了好多次，应该没什么问题。

johnnychen94 · 2020 年7 月 4 日 10:07

实际上 a+b 最终也是通过broadcast来做的，这个差异可以理解成 a+b 里作了一些额外的检查工作

julia> using BenchmarkTools

julia> a = rand(10000, 1000);

julia> b = rand(10000, 1000);

julia> @btime $a + $b;
  15.700 ms (2 allocations: 76.29 MiB)

julia> @btime $a .+ $b;
  15.484 ms (2 allocations: 76.29 MiB)

向量与非向量版本的最大差异在于：

向量版本可以最大化利用CPU的SIMD功能来提高执行效率 – （如果我没理解错的话）本质上也是broadcast
向量版本意味着如果有中间变量的话，那么中间变量也需要一个同样大小的向量来存储 – 这是为什么在Julia下不建议写向量版本的原因

julia> function my_muladd(a, b, c)
       a + b .* c
       end
my_muladd (generic function with 1 method)

julia> @btime my_muladd(a, b, a); # b .* a 会分配一个大矩阵作为中间结果
  30.305 ms (4 allocations: 152.59 MiB)

julia> @btime my_muladd.(a, b, a); # b .* a 会分配一个标量作为中间结果
  14.351 ms (4 allocations: 76.29 MiB)

johnnychen94 · 2020 年7 月 4 日 10:21

GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language

If the expression you want to benchmark depends on external variables, you should use $ to “interpolate” them into the benchmark expression to avoid the problems of benchmarking with globals. Essentially, any interpolated variable $x or expression $(...) is “pre-computed” before benchmarking begins:

Junars · 2020 年7 月 4 日 11:16

这个my_muladd.(a,b,a) 函数名后面的点没看明白，为什么就变成分配标量了？我感觉我对这个.的意义没完全理解，还请多解释一下~

johnnychen94 · 2020 年7 月 4 日 12:49

当你作广播的时候，是对每一个分量调用该函数：

julia> function my_muladd(a, b, c)
       tmp = b .* c
       @info "Types:" a=typeof(a) tmp=typeof(tmp)
       a + tmp
       end
my_muladd (generic function with 1 method)

julia> a = fill(1, 2, 2);

julia> b = fill(2, 2, 2);

julia> c = fill(4, 2, 2);

julia> my_muladd(a, b, c)
┌ Info: Types:
│   a = Array{Int64,2}
└   tmp = Array{Int64,2}
2×2 Array{Int64,2}:
 9  9
 9  9

julia> my_muladd.(a, b, c)
┌ Info: Types:
│   a = Int64
└   tmp = Int64
┌ Info: Types:
│   a = Int64
└   tmp = Int64
┌ Info: Types:
│   a = Int64
└   tmp = Int64
┌ Info: Types:
│   a = Int64
└   tmp = Int64
2×2 Array{Int64,2}:
 9  9
 9  9

Junars · 2020 年7 月 4 日 13:22

是的是的，一开始看的大脑短路了，以为是对(a,b,a)这三个参数作用my_muladd了，当成了map的语义惭愧惭愧，让您多解释了一番，非常感谢！

shilu1984 · 2020 年7 月 5 日 00:11

这个测试好像有问题。你得有一个和a和b一样的c，这样@. c = a+b会element-wise。如果没有dot，a+b会产生临时变量用于存计算结果，最后将临时计算结果赋给等号左边。
如果单纯的比较a .+ b和a + b需要的内存相差无几，因为都要产生计算结果，a .+ b是广播，a + b是直接调用+函数，其内部实现可能也是广播。