Julia编译的时候优化会进行到哪一步？

XuJingye · 2023 年3 月 14 日 17:56

承接之前的一次讨论：
合理解读效率测试结果 - 综合讨论区 / 心得体会 - Julia中文社区 (juliacn.com)

这个简单的例子中，中英文论坛都有人提到了，返回的是nothing，所以程序优化之后，什么都不会做。

然而我发现这个优化似乎比较简单。即使我在程序中假如一个无聊的Vector，假装最后会把循环的结果会赋值给这个Vector。即使这个Vector并不输出，它也会认真循环。或者我在循环前面，对某个传入的Vector进行修改，后面的循环也会认真计算。

using BenchmarkTools

tmp_Vector = zeros(2)

function main1(v) # 矢量无效传入，无效循环。for循环确实没有计算。1.183ns
    s = 0
    for i in 1:100
        s += i
    end
end

function main2!(v) # 矢量有效传入，无效循环。但是循环居然也做了计算。14.261 ns
    v[1] = 1
    s = 0
    for i in 1:100
        s += i
    end
    nothing
end

function main3!(v) # 这个是为了和上面对比时间，循环真的做了计算。14.560 ns
    v[1] = 1
    s = 0
    for i in 1:100
        s += i
    end
    v[2] = s
    nothing
end


# @benchmark main1(tmp_Vector)
# @benchmark main2!(tmp_Vector)
@benchmark main3!(tmp_Vector)

我之前以为Julia在编译程序时，对程序的优化非常牛。可以省去一些计算步骤的设计。比如矢量*100个标量和100个标量*矢量这种计算顺序的优化，我以为它可以的。然而刚刚试了下，

using BenchmarkTools

v = zeros(10000)

function main1(v)
    v * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10
end

function main2(v)
    2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10 * v
end

@benchmark main1($v)
@benchmark main2($v)

结果其实根本没有。

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   59.600 μs …   5.741 ms  ┊ GC (min … max):  0.00% … 70.53%
 Time  (median):     208.700 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   220.150 μs ± 333.237 μs  ┊ GC (mean ± σ):  16.41% ± 10.76%

  ▆  ▁█
  █▄▃██▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂ ▂
  59.6 μs          Histogram: frequency by time         2.35 ms <

 Memory estimate: 703.55 KiB, allocs estimate: 18.

BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   6.167 μs …  1.597 ms  ┊ GC (min … max):  0.00% … 98.21%
 Time  (median):     23.267 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   24.078 μs ± 63.261 μs  ┊ GC (mean ± σ):  16.75% ±  6.49%

  ▁▂▃█▆▂▂▂▂▁    ▁▃▃▃▄█▇▄▂▃▃▂▁▁▁                    ▁          ▂
  ██████████▇▇▇▆█████████████████▇█▇██▇▇▇▆▆▇▆▅▆▇▅▆▇██▇▄▅▄▂▄▄▅ █
  6.17 μs      Histogram: log(frequency) by time      58.8 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

那么Julia是只针对没有矢量参与函数内部计算的特殊情形有优化吗？
如果只对这一种for循环有优化，那么它这么弄的用法在哪里？
如果其它的情形有优化，可以给个链接吗？

XuJingye · 2023 年3 月 17 日 04:55

XuJingye:

function main3!(v) # 这个是为了和上面对比时间，循环真的做了计算。14.560 ns
    v[1] = 1
    s = 0
    for i in 1:100
        s += i
    end
    v[2] = s
    nothing
end

新算的结果

julia> @benchmark main3!($tmp_Vector)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.500 ns … 21.400 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.600 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.023 ns ±  2.411 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃                                                       ▁ ▁
  ██▃▆▆▆▇▇▄▅▄▁▄▄▁▁▄▁▁▃▃▁▁▁▄▃▁▁▁▄▄▇▄▁▄▇▄▁▁▁▁▄▁▁▁▃▁▃▃▁▁▁▁▁▄▄▆█ █
  2.5 ns       Histogram: log(frequency) by time     19.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> tmp_Vector     # 结果是正确的
2-element Vector{Float64}:
    1.0
 5050.0

对于这里的计算结果2.6ns我非常好奇，算了下，我的电脑2.6GHz，2.6ns不到7个周期。我很好奇怎么算的？我以为至少100周期呢？@code_llvm看不懂。

julia> @code_llvm main3!(tmp_Vector)
;  @ c:\Users\徐景晔\Documents\JuliaProgram\优化Julia\test2.jl:21 within `main3!'
; Function Attrs: uwtable
define nonnull {}* @"japi1_main3!_1703"({}* %0, {}** %1, i32 %2) #0 {
top:
  %3 = alloca {}**, align 8
  store volatile {}** %1, {}*** %3, align 8
  %4 = load {}*, {}** %1, align 8
;  @ c:\Users\徐景晔\Documents\JuliaProgram\优化Julia\test2.jl:22 within `main3!'
; ┌ @ array.jl:843 within `setindex!'
   %5 = bitcast {}* %4 to { i8*, i64, i16, i16, i32 }*
   %6 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %5, i64 0, i32 1
   %7 = load i64, i64* %6, align 8
   %.not = icmp eq i64 %7, 0
   br i1 %.not, label %oob, label %idxend

oob:                                              ; preds = %top
   %8 = alloca i64, align 8
   store i64 1, i64* %8, align 8
   call void @jl_bounds_error_ints({}* %4, i64* nonnull %8, i64 1)
   unreachable

idxend:                                           ; preds = %top
   %9 = bitcast {}* %4 to double**
   %10 = load double*, double** %9, align 8
   store double 1.000000e+00, double* %10, align 8
; └
;  @ c:\Users\徐景晔\Documents\JuliaProgram\优化Julia\test2.jl:27 within `main3!'
; ┌ @ array.jl:843 within `setindex!'
   %.not10 = icmp eq i64 %7, 1
   br i1 %.not10, label %oob7, label %idxend8

oob7:                                             ; preds = %idxend
   %11 = alloca i64, align 8
   store i64 2, i64* %11, align 8
   call void @jl_bounds_error_ints({}* %4, i64* nonnull %11, i64 1)
   unreachable
idxend8:                                          ; preds = %idxend  
   %12 = getelementptr inbounds double, double* %10, i64 1
   store double 5.050000e+03, double* %12, align 8
; └
;  @ c:\Users\徐景晔\Documents\JuliaProgram\优化Julia\test2.jl:28 within `main3!'                                                         hin `main3!'
  ret {}* inttoptr (i64 1800907472 to {}*)
}

zzz654321 · 2023 年3 月 17 日 08:38

store double 5.050000e+03, double* %12, align 8

循环优化掉了。

XuJingye · 2023 年3 月 17 日 10:05

意思是它发现s最终是个固定值，直接把这个值记下是吧？
牛

XuJingye · 2023 年3 月 17 日 10:19

可是，我测试了下累加，保留参数n，为啥还是那么快？

function main1(n)
    s = 0
    for i in 1:n
        s += i
    end
    s
end

对应@code_llvm结果

julia> @code_llvm main1(100)
;  @ d:\JuliaProgramme\优化Julia\test2.jl:5 within `main1`
; Function Attrs: uwtable
define i64 @julia_main1_2285(i64 signext %0) #0 {
top:
;  @ d:\JuliaProgramme\优化Julia\test2.jl:7 within `main1`
; ┌ @ range.jl:5 within `Colon`
; │┌ @ range.jl:354 within `UnitRange`
; ││┌ @ range.jl:359 within `unitrange_last`
     %.inv = icmp sgt i64 %0, 0
     %1 = select i1 %.inv, i64 %0, i64 0
; └└└
  br i1 %.inv, label %L13.preheader, label %L29

L13.preheader:                                    ; preds = %top
;  @ d:\JuliaProgramme\优化Julia\test2.jl:8 within `main1`      
  %2 = shl nuw i64 %1, 1
  %3 = add nsw i64 %1, -1
  %4 = zext i64 %3 to i65
  %5 = add nsw i64 %1, -2
  %6 = zext i64 %5 to i65
  %7 = mul i65 %4, %6    
  %8 = lshr i65 %7, 1     
  %9 = trunc i65 %8 to i64
  %10 = add i64 %2, %9
  %11 = add i64 %10, -1
;  @ d:\JuliaProgramme\优化Julia\test2.jl:10 within `main1`
  br label %L29

L29:                                              ; preds = %L13.preheader, %top
  %value_phi9 = phi i64 [ 0, %top ], [ %11, %L13.preheader ]
  ret i64 %value_phi9
}

它难不成会用n(n+1)/2?

Rratic · 2023 年3 月 17 日 11:46

我猜会
这好像是编译器优化的典型例子

zzz654321 · 2023 年3 月 20 日 08:15

看代码已经是了。

woclass · 2023 年3 月 21 日 06:57

julia 后端用的 llvm 的优化，讲道理大家都大差不差的。