听说m1芯片对julia的编译效率提升很大,如果真是这样,就很好解决了julia的痛点。
我目前用的cpu是12700k,多核运行速度还行,但是编译太慢了,每次重启要等很久。
所以想问一下,m1芯片特别是mac studio max版本对julia的速度提升有多少?
- Mac ARM 系列依然是 Tier 3 支持,意味着很多包无法使用
- Mac x64 + Rosetta 转译的话比 12700k 肯定要快很多的。我测试过 mac M1 下
using Images
比 macbookpro i9 也还是要快。
1 个赞
单核或者多核的运行速度呢?大概相差多少?
我现在手上也暂时接触不到 apple silicon 的机器,所以没办法给你具体的数值。但之前测试的结果看下来即使是转译模式也比 intel 要快。
1 个赞
我刚才拿网上已经测试的老版M1的结果和我自己的12700K对比了一下,指定线程数下老版的M1速度都要比intel的慢。
这个是我的测试
Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
Threads: 20 on 20 virtual cores
Environment:
JULIA_NUM_THREADS = 20
using LinearAlgebra, BenchmarkTools
A=rand(1000,1000); B = rand(1000,1000);
@benchmark $A * $B
BenchmarkTools.Trial: 559 samples with 1 evaluation.
Range (min … max): 6.892 ms … 14.580 ms ┊ GC (min … max): 0.00% … 37.91%
Time (median): 8.184 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.934 ms ± 1.786 ms ┊ GC (mean ± σ): 8.81% ± 13.76%
▅▄█▆▄
▂▁▂▁▁▃▄▆██████▆▃▄▃▃▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▃▄▄▄▃▃▄▃▃▃▂ ▃
6.89 ms Histogram: frequency by time 13.5 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
------------------------------
BLAS.set_num_threads(1)
@benchmark $A * $B
BenchmarkTools.Trial: 169 samples with 1 evaluation.
Range (min … max): 28.425 ms … 36.640 ms ┊ GC (min … max): 0.00% … 8.47%
Time (median): 29.222 ms ┊ GC (median): 0.00%
Time (mean ± σ): 29.734 ms ± 1.466 ms ┊ GC (mean ± σ): 1.72% ± 3.57%
▁█▅ ▁ ▂
███▇▅█▃█▆▅█▅▅▆▆▃▃▃▁▃▃▃▃▃▃▁▁▁▃▃▃▃▅▃▃▃▁▃▃▄▁▃▃▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▃ ▃
28.4 ms Histogram: frequency by time 34.5 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
-----------------------------------------------
BLAS.set_num_threads(2)
@benchmark $A * $B
BenchmarkTools.Trial: 305 samples with 1 evaluation.
Range (min … max): 15.164 ms … 21.287 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.755 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.393 ms ± 1.321 ms ┊ GC (mean ± σ): 2.95% ± 5.83%
▇█▇
▄▄▆████▅▆▆▆▅▅▄▄▂▃▂▂▂▂▁▂▁▁▁▁▂▁▂▁▂▂▃▃▄▃▅▃▃▃▃▄▂▃▁▂▂▂▁▁▁▁▁▂▁▂▁▂ ▃
15.2 ms Histogram: frequency by time 20.5 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2
------------------------------------------------
BLAS.set_num_threads(4)
@benchmark $A * $B
BenchmarkTools.Trial: 521 samples with 1 evaluation.
Range (min … max): 8.264 ms … 17.028 ms ┊ GC (min … max): 0.00% … 25.05%
Time (median): 8.891 ms ┊ GC (median): 0.00%
Time (mean ± σ): 9.594 ms ± 1.571 ms ┊ GC (mean ± σ): 6.35% ± 10.89%
▅█▅▁
▃▄▇████▇▅▄▃▃▂▁▁▁▂▁▁▂▂▁▂▁▂▂▁▂▁▁▁▂▁▂▃▃▄▄▄▄▃▂▃▂▂▁▁▁▂▁▁▁▁▁▁▂▁▂ ▃
8.26 ms Histogram: frequency by time 14.8 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
------------------------------------------------
BLAS.set_num_threads(8)
@benchmark $A * $B
BenchmarkTools.Trial: 670 samples with 1 evaluation.
Range (min … max): 5.733 ms … 17.214 ms ┊ GC (min … max): 0.00% … 26.59%
Time (median): 6.337 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.452 ms ± 2.079 ms ┊ GC (mean ± σ): 9.43% ± 14.67%
▂██▆▆▅▂▂▁ ▁ ▁ ▂▃▂▂
█████████▇█▇█▇▇▇▆▅▆▇▇▄▆▇█▆█▆███████▇▇▅▆▇▇▅▅▅▄▄▆▁▄▄▅▁▁▄▄▄▁▄ █
5.73 ms Histogram: log(frequency) by time 14 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
------------------------------------------------
BLAS.set_num_threads(10)
@benchmark $A * $B
BenchmarkTools.Trial: 444 samples with 1 evaluation.
Range (min … max): 7.809 ms … 16.389 ms ┊ GC (min … max): 0.00% … 30.29%
Time (median): 10.819 ms ┊ GC (median): 0.00%
Time (mean ± σ): 11.265 ms ± 1.846 ms ┊ GC (mean ± σ): 6.78% ± 11.42%
▄█▆█▃
▃▃▃▃▃▃▃▃▁▃▄▃▄▃▃▃▄▄▇███████▄▃▁▁▁▁▂▁▁▁▁▂▂▁▁▂▁▃▁▃▁▃▁▂▂▃▂▄▄▄▅▅▅ ▃
7.81 ms Histogram: frequency by time 15.7 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
下面这个是网上现有的测试:
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.1.0)
CPU: Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
julia> using LinearAlgebra, BenchmarkTools
julia> A = rand(1000,1000); B = rand(1000,1000);
julia> @benchmark $A * $B
BenchmarkTools.Trial: 665 samples with 1 evaluation.
Range (min … max): 7.155 ms … 19.538 ms ┊ GC (min … max): 0.00% … 6.96%
Time (median): 7.252 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.518 ms ± 734.446 μs ┊ GC (mean ± σ): 2.42% ± 4.87%
▃█▇▃▂ ▂▂
█████▆▅▅▅▆▅▁▄▁▁▄▁▄▄▄▁▁████▇▇▆▆▆▁▁▅▅▅▆▄▅▁▁▄▁▅▁▁▁▁▁▁▁▁▄▅▁▁▄▁▅ ▇
7.16 ms Histogram: log(frequency) by time 9.84 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> BLAS.set_num_threads(1)
julia> @benchmark $A * $B
BenchmarkTools.Trial: 119 samples with 1 evaluation.
Range (min … max): 42.017 ms … 43.615 ms ┊ GC (min … max): 0.00% … 3.18%
Time (median): 42.179 ms ┊ GC (median): 0.00%
Time (mean ± σ): 42.364 ms ± 440.317 μs ┊ GC (mean ± σ): 0.44% ± 1.00%
▅▅▄█▄▁
▃▆███████▆▅▃▃▃▄▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▅▄▁▃▃▄▃▁▁▄▁▅▁▁▁▃ ▃
42 ms Histogram: frequency by time 43.6 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> BLAS.set_num_threads(2)
julia> @benchmark $A * $B
BenchmarkTools.Trial: 220 samples with 1 evaluation.
Range (min … max): 22.335 ms … 24.428 ms ┊ GC (min … max): 0.00% … 7.45%
Time (median): 22.536 ms ┊ GC (median): 0.00%
Time (mean ± σ): 22.726 ms ± 463.375 μs ┊ GC (mean ± σ): 0.85% ± 1.87%
▃▆█▇▆▃
█▁▆██████▇█▆▅▅▁▅▁▅▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▅▆▁██▇█▁█▅▅▆▅▁▅▅▅▁▅▁▁▁▅ ▆
22.3 ms Histogram: log(frequency) by time 24.2 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> BLAS.set_num_threads(4)
julia> @benchmark $A * $B
BenchmarkTools.Trial: 392 samples with 1 evaluation.
Range (min … max): 12.271 ms … 14.460 ms ┊ GC (min … max): 0.00% … 13.15%
Time (median): 12.555 ms ┊ GC (median): 0.00%
Time (mean ± σ): 12.760 ms ± 501.337 μs ┊ GC (mean ± σ): 1.67% ± 3.49%
▅▇██▅
▄▁▁▇████████▄▅▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▅█▆▅▆▆▇██▁█▇▆▇▆▄▁▆▅▅ ▆
12.3 ms Histogram: log(frequency) by time 14.2 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> BLAS.set_num_threads(8)
julia> @benchmark $A * $B
BenchmarkTools.Trial: 666 samples with 1 evaluation.
Range (min … max): 7.178 ms … 23.716 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.270 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.509 ms ± 800.333 μs ┊ GC (mean ± σ): 2.63% ± 5.27%
▂██▆▃ ▁ ▂
██████▄▆▆▄▄▁▁▁▁▁▁▁▁▅▁▁▁▄▄▄▁▁▄▁▆█████▆▇█▇▆▆▆▅▁▆▄▄▄▄▄▁▁▁▄▁▁▁▄ ▇
7.18 ms Histogram: log(frequency) by time 9.2 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> BLAS.set_num_threads(10)
julia> @benchmark $A * $B
BenchmarkTools.Trial: 274 samples with 1 evaluation.
Range (min … max): 15.185 ms … 47.571 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 17.907 ms ┊ GC (median): 0.00%
Time (mean ± σ): 18.285 ms ± 2.479 ms ┊ GC (mean ± σ): 1.26% ± 2.71%
▁▆▂▁▁▅▁█ ▂█▁▆▃ ▃ ▂ ▁
▄▄▁▆▁▅▇█▆█████████▇█████▆▆█▆█▆█▅██▅▅▅▄▁▅▇▁▄▇▄▅▆▃▁▃▁▁▄▃▃▁▃▁▃ ▄
15.2 ms Histogram: frequency by time 22.9 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
m1多核慢的劣势不知道可不可以通过以下方法提高:
- 上面julia的版本是1.7.0,网上有说1.7.1解决了m1多核的问题;
- 上面的m1是初代,不知道最新的mac studio性能相对初代会不会更好;
- 上面m1的测试是在2021.09做的,当时julia在m1上可能有些缺陷,不知道最新版的1.8.0会不会已经修正了这些缺陷?
我自己的体验是能用的包比想象的多。现在我不能用的包主要是少数 binary 依赖的 (_jll
结尾)的包不能用。
我觉得神奇的是我用 M1 Pro(8 performance core+2 efficiency core)跑出来的结果居然比 M1 还慢,不知道为什么:
julia> using LinearAlgebra, BenchmarkTools
julia> A=rand(1000,1000); B = rand(1000,1000);
julia> @benchmark $A * $B
BenchmarkTools.Trial: 290 samples with 1 evaluation.
Range (min … max): 15.307 ms … 23.579 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.995 ms ┊ GC (median): 0.00%
Time (mean ± σ): 17.269 ms ± 1.273 ms ┊ GC (mean ± σ): 0.21% ± 0.52%
▂▃ █ ▄▃
▃▄▅▃▄▅█▆███▇█▅██▇██▇▅▇▆▅▆▄▃▄▅▃▃▃▃▄▃▄▄▁▃▃▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▃▁▃▃ ▃
15.3 ms Histogram: frequency by time 22 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> BLAS.set_num_threads(1)
julia> @benchmark $A * $B
BenchmarkTools.Trial: 50 samples with 1 evaluation.
Range (min … max): 99.670 ms … 102.885 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 100.109 ms ┊ GC (median): 0.00%
Time (mean ± σ): 100.365 ms ± 685.301 μs ┊ GC (mean ± σ): 0.03% ± 0.07%
julia> versioninfo()
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: 10 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, westmere)
这个测试例子算两个矩阵乘法,调用的BLAS,作为测试Julia代码可能有失公允。BLAS和Julia的线程交互问题且不谈,当两个矩阵达到1000*1000的规模时,我们几乎就是在测试memory和cache的大小和性能了。看Benchmark结果还是要谨慎些,尤其是不同平台的。而且现在开始推大小核了,我也不甚清楚系统如何调度资源的…
1 个赞