多核并行计算效率问题

Elliothuo · 2019 年7 月 11 日 11:38

一个简单的例子，数组a中的每一个元素除以3，代码如下：

function foo_2(n)
    a = collect(1:n)
    for i = 1:size(a, 1)
        a[i] = a[i]/3
    end
end

julia > @benchmark foo_2(100000000.0)
BenchmarkTools.Trial: 
  memory estimate:  762.94 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.264 s (0.90% GC)
  median time:      1.399 s (8.76% GC)
  mean time:        1.376 s (7.50% GC)
  maximum time:     1.444 s (10.73% GC)
  --------------
  samples:          4
  evals/sample:     1

我将它改成并行的例子后：

using SharedArrays
using Distributed
addprocs(2)

function foo_1(n)
    a = collect(1:n)
    a = SharedArray(a)
    @sync @distributed for i = 1:size(a, 1)
        a[i] = a[i]/3
    end
end

julia > @benchmark foo_1(100000000.0)
BenchmarkTools.Trial: 
  memory estimate:  762.97 MiB
  allocs estimate:  809
  --------------
  minimum time:     10.795 s (0.10% GC)
  median time:      10.795 s (0.10% GC)
  mean time:        10.795 s (0.10% GC)
  maximum time:     10.795 s (0.10% GC)
  --------------
  samples:          1
  evals/sample:     1

我不太理解为什么并行了之后怎么还慢了这么多，是哪里的问题呢？

Jun · 2019 年7 月 12 日 02:52

这样对比不太合理，多进程的例子里，把共享内存的分配那块算进去了，另外，这里不太适合用@distributed，它会把后面的计算内容发到各个worker去计算，这相当于会发起n次调度，每次调度只做了简单的计算，所以很慢的。

你可以琢磨琢磨文档里的那三个例子：

https://docs.juliacn.com/latest/manual/parallel-computing/#man-shared-arrays-1

cherichy · 2019 年7 月 12 日 08:26

这种问题用多线程就可以，开销会比多进程小的多

Elliothuo · 2019 年7 月 12 日 10:46

谢谢，我明白了

Elliothuo · 2019 年7 月 12 日 10:51

没错，这种确实更适合多线程，我之前看到pmap的文档，可能我理解错了：

Julia’s pmap is designed for the case where each function call does a large amount of work. In contrast, @distributed for can handle situations where each iteration is tiny, perhaps merely summing two numbers. Only worker processes are used by both pmap and @distributed for for the parallel computation. In case of @distributed for, the final reduction is done on the calling process.

所以我一直以为@distributed适合算简单一点儿的例子

HaoLi111 · 2019 年7 月 24 日 03:06

这个问题，也许使用

a./3

用点除法（dot operator）会更好。

如果a很大很大，也许在显卡上用点除？我没试过。

HaoLi111 · 2019 年7 月 24 日 03:10

如果一定要用CPU并行的话，可以使用把a分成几个大段来分别除以3再进行reduce。当然这样不是很优雅。

如果这个数组很大，可以考虑

用MKL自带并行
用CuArray
刚才所说的分割法
不过，点+除法在大多数情况下有不错的性能，因此个人认为除非这个地方特别耗时，没必要过早压榨这个时间