broadcast是不是矢量化？

chenxl · 2018 年8 月 26 日 12:07

文档里说julia不希望程序写成矢量化，那broadcast例如@.好像可以提高效率，使用它好像有需要把代码写成矢量化了，岂不是矛盾了？

Gnimuc · 2018 年8 月 26 日 12:20

在0.6之前，因为没有简单的方法做到 loop fusion, 所以将程序写成 vectorized 会影响效率。现在使用 @. 可以实现 syntactic loop fusion 了，相对于之前裸 vectorize 是加速了。文档里说不希望写成 vectorized 是强调 for-loops 在 Julia 里原生就是很快的，不需要像 Matlab, Python 或 R 那样进行 vectorize 来提升效率，避开循环。

chenxl · 2018 年8 月 26 日 13:12

broadcast貌似比普通的for要快，是不是为了效率还是需要写成矢量化？

Gnimuc · 2018 年8 月 26 日 13:17

不会的，因为broadcast最后还是会被转译成for-loops。

Scheme · 2018 年9 月 13 日 18:27

github.com/JuliaLang/julia

Broadcasting is much slower than a for loop

已打开 08:13PM - 15 Jul 18 UTC

YingboMa

performance regression broadcast simd

Here is a minimal working example. ```julia julia> using BenchmarkTools j…ulia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T @. a = b + 0.1 * (0.2c + 0.3d + 0.4e) nothing end foo (generic function with 1 method) julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T @assert length(a) == length(b) == length(c) == length(d) == length(e) @inbounds for i in eachindex(a) a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i]) end nothing end goo (generic function with 1 method) julia> a,b,c,d,e=(rand(1000) for i in 1:5) Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##9#10"))}(getfield(Main, Symbol("##9#10"))(), 1:5) julia> @btime foo($a,$b,$c,$d,$e) 1.277 μs (0 allocations: 0 bytes) julia> @btime goo($a,$b,$c,$d,$e) 345.568 ns (0 allocations: 0 bytes) julia> versioninfo() Julia Version 0.7.0-beta2.12 Commit a878341 (2018-07-15 15:57 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.0 (ORCJIT, skylake) Environment: JULIA_PKG3_PRECOMPILE = 1 ```

Scheme · 2018 年9 月 13 日 19:38

(我把每一段都写得比较短, 感觉这样更容易读. 我一般把多余的信息写在小括号里, 不知道这样好不好. 如果可能的话, 希望可以在下方留些建议.)

我个人感觉矢量化这个名词在高级语言中有很大的歧义 (见维基百科), 尤其在Julia语言.

在"底层"语言 (我不知道什么是底层的严格定义)中, 矢量化是指使用SIMD, 即为一个指令多个数据, 而且这"多个数据"是一个类型的, 就像矢量一样.

后来有了"高层"语言 (这个高层也不是严格定义的), 比如Fortran 90(很多人会说它底层, 所以之前说了不是严格定义)和MATLAB等语言, 矢量化一般是指在数组上操作. 比如说 C=A+B, 当A, B和C都是数组的时候 (比如矩阵. 题外话, 数值分析上, Housholder notation [很抱歉, 不知道中文对应翻译], 希腊小写字母是标量, 小写拉丁字母是矢量, 大写拉丁字母是矩阵).

从第一个意义上看, Julia的broadcast是矢量化代码, 因为如果数据在内存线性排列 (比如StridedArray 见下方 [1]), broadcast 可能会用SIMD. 从第二个定义上看, broadcast直接在数组上操作, 所以这么看, 也是矢量化.

但是!

在Julia文档上说不希望矢量化指的不是上面两种定义, 而是需要调用外部库的矢量化 (会产生没有必要的, 暂时的数组 (temporary array)), 如MATLAB或者Python的NumPy那样. 如果你在MATLAB里写 X = A+B+C+D, MATLAB运行的是类似于

temp1 = A + B
temp2 = temp1 + C
X = temp2 + D

的样子. 因为这里的 + 要调用别的语言写的, 优化的库 (我猜是BLAS的 daxpy, 不过我不知道MATLAB具体用什么). 如果用 for 循环写的话, 就不会有这些问题, 如

@assert ...
X = similar(D)
@inbounds for i in eachindex(X)
    X[i] = A[i] + B[i] + C[i] + D[i]
end

而且这样还有一个好处就是, 没有内存浪费 (不需要 temp), 而且 只需要一次检查 矩阵的大小就不用再判断数组访问是否合法了, 所以可以用 @inbounds 来避免没有必要的访问正确性检查. (当然, 实际上不是这么简单的 for 循环, 因为Julia要支持非1开始的数组, 如果X是1开始, 而A不是的话, 上面就会出错. 细节见这里.)

Julia的broadcast机制就是把 D = A .+ B .+ C .+ D 重新写成 for 循环. 所以说, 文档写了避免矢量化 (因为会造成没必要的内存浪费) 用broadcast.

总结

Julia的broadcast虽然看起来像MATLAB的风格的矢量化, 但从本质上不一样. Julia用的 for 循环, MATLAB的矢量化是调用别的库. 而且, Julia的broadcast十分灵活而且很容易拓展, 甚至可以放在GPU. 所以, 把broadcast想成一个对循环的抽象就好. 虽然感觉还有很多东西没有提到, 但是我感觉这么多也很够了. 是否要移到新的帖子上? @Gnimuc

脚注

julia> StridedArray
Union{DenseArray{T,N}, ReinterpretArray{T,N,S,A} where S where A<:Union{SubArray{T,N,A,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where A<:DenseArray where N where T, DenseArray}, ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{ReinterpretArray{T,N,S,A} where S where A<:Union{SubArray{T,N,A,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where A<:DenseArray where N where T, DenseArray} where N where T, SubArray{T,N,A,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where A<:DenseArray where N where T, DenseArray}, SubArray{T,N,A,I,L} where L where I<:Tuple{Vararg{Union{Int64, AbstractRange{Int64}, AbstractCartesianIndex},N} where N} where A<:Union{ReinterpretArray{T,N,S,A} where S where A<:Union{SubArray{T,N,A,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where A<:DenseArray where N where T, DenseArray} where N where T, ReshapedArray{T,N,A,MI} where MI<:Tuple{Vararg{SignedMultiplicativeInverse{Int64},N} where N} where A<:Union{ReinterpretArray{T,N,S,A} where S where A<:Union{SubArray{T,N,A,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where A<:DenseArray where N where T, DenseArray} where N where T, SubArray{T,N,A,I,true} where I<:Tuple{AbstractUnitRange,Vararg{Any,N} where N} where A<:DenseArray where N where T, DenseArray} where N where T, DenseArray}} where N where T

Gnimuc · 2018 年9 月 14 日 01:11

我把帖子移到性能优化的板块下了，要不给你版主权限方便操作帖子。

chenxl · 2018 年9 月 14 日 01:31

好的啊，是给我吗？

Scheme · 2018 年9 月 14 日 01:38

好啊, 不过版主权限是什么… emmm

Scheme · 2018 年9 月 14 日 01:41

希望之前写的解释不是非常的让人困惑. 现在懂了吗?