关于@threads 线程优化的问题

Rozarspady · 2020 年8 月 7 日 13:57

之前一直使用的matlab，涉及到线程的地方使用parfor，加入开了4个线程，基本可以使程序块运行时间减少4倍。为什么在julia里面调用threads后速度提升并不是很明显啊。举个例子就是求解微分方程，我求解1000组不使用多线程需要60s，但是使用4个线程所需要的时间是40s，而在matlab里面基本就是时间少4倍。请问这个是什么原因啊。

Rozarspady · 2020 年8 月 7 日 14:03

使用的版本是1.4.2

Sukanka · 2020 年8 月 7 日 15:34

求求你了，上代码吧

Rozarspady · 2020 年8 月 8 日 01:12

using Base.Threads
using DifferentialEquations
using PyCall
@pyimport numpy

#to calculate the odefun with initial conditions

#define laser
omg=0.057
T=2*π/omg
tf=7*T
function GetLaser(omg,I,elli,t)
    #linear
    E0=sqrt(I/(3.51e16))
    T=2*π/omg
    #agapi form
    #Ex=-E0/omg*(exp(-t^2/421.3428^2)*cos(omg*t)*omg-2*t/421.3428^2*exp(-t^2/421.3428^2)*sin(omg*t))
    #f(t) Trapezoidal envelope
    Ex=E0*(1.0*(t>=0.0 && t<5*T)+(-t/(2.0*T)+7/2)*(t>=5*T && t<=7*T))*cos(omg*t)
    Ey=0.0
    Ez=0.0
    return [Ex Ey Ez]
end

#define odefun
#p=[omg I elli soft]
function odeHe!(du,u,p,t)
    #e1 with n
    ne1=(u[1]^2+u[3]^2+u[5]^2+p[4])^(-3/2)
    #e2 with n
    ne2=(u[7]^2+u[9]^2+u[11]^2+p[4])^(-3/2)
    #e1 with e2
    e1e2=((u[1]-u[7])^2+(u[3]-u[9])^2+(u[5]-u[11])^2)^(-3/2)
    #laser
    E=GetLaser(p[1],p[2],p[3],t)
    Ex=E[1]
    Ey=E[2]
    Ez=E[3]
    #the direction of k:z direction
    Bx=-Ey/137.0
    By=Ex/137.0
    Bz=0.0

    du[1]=u[2]
    du[2]=-2.0*u[1]*ne1+(u[1]-u[7])*e1e2-Ex+(u[6]*By-u[4]*Bz)
    du[3]=u[4]
    du[4]=-2.0*u[3]*ne1+(u[3]-u[9])*e1e2-Ey+(u[2]*Bz-u[6]*Bx)
    du[5]=u[6]
    du[6]=-2.0*u[5]*ne1+(u[5]-u[11])*e1e2-Ez+(u[4]*Bx-u[2]*By)
    du[7]=u[8]
    du[8]=-2.0*u[7]*ne2+(u[7]-u[1])*e1e2-Ex+(u[12]*By-u[10]*Bz)
    du[9]=u[10]
    du[10]=-2.0*u[9]*ne2+(u[9]-u[3])*e1e2-Ey+(u[8]*Bz-u[12]*Bx)
    du[11]=u[12]
    du[12]=-2.0*u[11]*ne2+(u[11]-u[5])*e1e2-Ez+(u[10]*Bx-u[8]*By)
end

#load initial conditions
iniconditions=numpy.loadtxt("ini.txt")
init=numpy.loadtxt("init.txt")
num=length(iniconditions[:,1])



final=zeros(Float64,num,12)
numthreads=nthreads()
println("num of threads = $numthreads")
@time @threads for i in 1:100
    #define p of ode
    p=[0.057,2.0e15,0,0.1]
    u0=iniconditions[i,:]
    tspan=(init[i],tf)
    tra=ODEProblem(odeHe!,u0,tspan,p)
    sol=solve(tra,Vern7(),reltol=1e-9, abstol=1e-9)
    final[i,:]=sol.u[end]
end

Rozarspady · 2020 年8 月 8 日 01:15

就是这样的一个解微分方程的程序，实际使用threads的速度发现不是很高快。使用一个线程大概100组需要7s使用4个线程后100组需要5s，速度提升并不明显

Sukanka · 2020 年8 月 8 日 07:44

把你的问题详细描述下把，比如把你的微分方程也写出来。你的代码我既读不懂，也运行不起来。
另外，@threads 并不是两个线程就比一个线程快一倍（虽然常常是），我以前也用过 @threads,大概一个线程要跑十个小时，两个线程5小时，4个线程3小时。

Rozarspady · 2020 年8 月 8 日 08:03

您好，按照您之前的使用经验我大概知道我问题在哪了。我没有去看程序运行后各个线程的运行时间，我遇到的问题恰是各个线程工作量不一样的体现。我直观认为开n个线程速度就应该提升n倍左右，这个理解是我错了，主要之前用matlab的parfor一直觉得是这样的。另外，我想咨询下有没有办法提高线程的效率啊，我想各个线程的工作量尽可能相等以提升效率，目前之前使用@threads各个线程的工作时间差距有些大。或者有没有办法手动分配线程id。谢谢！

johnnychen94 · 2020 年8 月 8 日 08:08

你可以试试看 ThreadPools.jl

Rozarspady · 2020 年8 月 8 日 08:10

感谢您的建议，我去试试看

xgdgsc · 2020 年8 月 8 日 08:12

JULIA_NUM_THREADS 是4吗,cpu有几个核

Rozarspady · 2020 年8 月 8 日 08:15

这个用nthreads()看过了是在自己组里服务器算的开了60个线程

xgdgsc · 2020 年8 月 8 日 08:21

可以参考下

github.com/JuliaLang/julia

Does the number of threads affect the executed code?

opened 02:45PM - 14 Dec 19 UTC

closed 08:25PM - 29 Sep 21 UTC

PetrKryslUCSD

parallel needs more info

The work is proportional to the number of elements. For a mesh of 128000 elemen…ts both a serial and 1-thread simulation carry out the computational work in 2.5 seconds. For a mesh of 1024000 elements both a serial and 1-thread simulation carry out the computational work in around 20.0 seconds. So, eight times more work, eight times longer. Now comes the weird part. When I use 2 threads, so that each thread works on 512000 elements, the amount of work per thread is 10 seconds. However the work procedure shows that it consumes around 16.5 seconds. When I use 4 threads, each thread works on 256,000 elements, and consequently the work procedure should execute in 5 seconds. However, the work procedure actually shows that it consumes roughly 15.6 seconds. With 8 threads, each thread works on 128,000 elements, and the work procedure should only take 2.5 seconds. However, it reports to take roughly 14 seconds. The threaded execution therefore looks like this: Number of elements Number of threads Execution time per thread 1024000 1 20 512000 2 16.5 256000 4 15.6 128000 8 14 The weird thing is I time the interior of the work procedure. So that should exclude any overhead associated with threading. However, as you can see the number of threads actually affects how much time the work procedure spends doing the work. The total amount of time farming out the work to the threads is very small. The total amount of time collecting the data with `wait` pretty much is equal to the amount of time reported by the work procedure. As if the overhead related to threading was very small. The whole thing can be exercised by ``` git clone https://github.com/PetrKryslUCSD/FinEtoolsDeforNonlinear.jl ``` followed by ``` cd FinEtoolsDeforNonlinear.jl export JULIA_NUM_THREADS=8 julia ``` and ``` include("threaded_test.jl") ``` I'm sorry I don't have a more minimal working example!

Rozarspady · 2020 年8 月 8 日 08:24

好的，我去看看

Rozarspady · 2020 年8 月 8 日 10:05

这个包的document在哪查看啊我在官网没搜到

hyper0x · 2020 年8 月 12 日 05:06