Julia内存分配随着循环的进行越来越大

tyfdzt · 2019 年12 月 22 日 08:50

刚从Matlab转过来不久，发现Julia做数值计算真的很快，但是发现了点问题，这是一个格子玻尔兹曼模拟的大部分代码，当mainFlow(iterStepMax)里面的最大迭代次数iterStepMax增加的时候，内存分配次数越来越多，我不知道这是否意味着性能下降，或者会不会出现计算机内存不够的情况，因为有时候迭代次数达到几万后数据就会显示为NaN。查了很多资料，包括性能建议里面的输出预分配https://docs.juliacn.com/latest/manual/performance-tips/，感觉不得其法，做了一些很小的优化，速度提升了，但是性能分析的内存分配仍然会增大，任务管理器中的内存占用却一直是稳定的，现在针对这个代码有点无从下手的感觉。恳请好心人帮忙分析下，内存分配的问题需不需要解决，如何应对。第一次发帖，有什么不清楚的请原谅，代码如下

function mainFlow(iterStepMax)
    Q = 9 #D2Q9
    nx = 1200 #lattice number in horizontal direction
    ny = 600 #lattice number in vertical direction
    U = 0.1 #inlet velocity
    cx = [1.0 0.0 -1.0 0.0 1.0 -1.0 -1.0 1.0 0.0] #velocity components in horizontal direction
    cy = [0.0 1.0 0.0 -1.0 1.0 1.0 -1.0 -1.0 0.0] #velocity components in vertical direction
    w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters
    ρ = ones(Float64, nx, ny) #initial density
    u = zeros(Float64, nx, ny) #horizontal velocity component
    v = zeros(Float64, nx, ny) #vertical velocity component
    u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
    v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
    uv = zeros(Float64, nx, ny) #resultant velocity
    f = zeros(Float64, nx, ny, Q) #distribution function
    feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function
    dx = 1.0 #horizontal lattice length
    dy = 1.0 #vertical lattice length
    lx = dx * nx #horizontal domain length
    ly = dy * ny #vertical domain length
    dt = dx #dt = 1.0
    c = dx / dt #lattice sound speed
    Re = 1000.0 #Reynoldz number
    ν = U * ly / Re #kinematic viscousity
    τ = 3.0ν + 0.5 #relaxation time
    ω = 1.0 / τ #relaxation frequency
    println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
    ε = 1.0e-6 #convergence
    @views u[1, 2:ny-1] .= U
    for iterStep = 1:iterStepMax
        collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        streaming(Q, cx, cy, f)
        boundaryCondition(nx, ny, u, v, ρ, U, f)
        macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
        temp2 = sum(u .^ 2 + v .^ 2)
        err = sqrt(temp1) / sqrt(temp2 + 1e-30)

        if err <= ε
            break
        end
    end
end

function collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
    for j = 1:ny, i = 1:nx
        t1 = u[i, j] * u[i, j] + v[i, j] * v[i, j]
        for k = 1:Q
            t2 = u[i, j] * cx[k] + v[i, j] * cy[k]
            feq[i, j, k] = ρ[i, j] * w[k] * (1.0 + 3.0t2 + 4.5 * t2^2 - 1.5t1)
            f[i, j, k] = (1.0 - ω) * f[i, j, k] + ω * feq[i, j, k]
        end
    end
end

function streaming(Q, cx, cy, f)
    for k = 1:Q
        @views f[:, :, k] .= circshift(f[:, :, k], [cx[k], cy[k]])
    end
end

function boundaryCondition(nx, ny, u, v, ρ, U, f)
    #left
    @views f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] .* U * 2.0 / 3.0
    @views f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
    @views f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
    @views u[1, 2:ny-1] .= U
    @views v[1, 2:ny-1] .= 0.0
    #right
    @views f[nx, :, 3] = f[nx-1, :, 3]
    @views f[nx, :, 6] = f[nx-1, :, 6]
    @views f[nx, :, 7] = f[nx-1, :, 7]
    #bottom
    @views f[:, 1, 2] = f[:, 1, 4]
    @views f[:, 1, 5] = f[:, 1, 7]
    @views f[:, 1, 6] = f[:, 1, 8]
    @views u[:, 1] .= 0.0
    @views v[:, 1] .= 0.0
    #top
    @views f[:, ny, 4] = f[:, ny, 2]
    @views f[:, ny, 8] = f[:, ny, 6]
    @views f[:, ny, 7] = f[:, ny, 5]
    @views u[:, ny] .= 0.0
    @views v[:, ny] .= 0.0
end

function macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
    ρ[:, :] = sum(f, dims = 3)
    u0[:, :] = u[:, :]
    v0[:, :] = v[:, :]
    for j = 1:ny, i = 1:nx
        uSum = 0.0
        vSum = 0.0
        for k = 1:Q
            uSum += f[i, j, k] * cx[k]
            vSum += f[i, j, k] * cy[k]
        end
        u[i, j] = uSum / ρ[i, j]
        v[i, j] = vSum / ρ[i, j]
    end
    uv[:, :] = @. sqrt(u[:, :]^2 + v[:, :]^2)
end

性能工具测试效果如下

@benchmark mainFlow(50)
BenchmarkTools.Trial: 
  memory estimate:  7.38 GiB
  allocs estimate:  10801
  --------------
  minimum time:     6.675 s (10.88% GC)
  median time:      6.675 s (10.88% GC)
  mean time:        6.675 s (10.88% GC)
  maximum time:     6.675 s (10.88% GC)
  --------------
  samples:          1
  evals/sample:     1

增大迭代次数

@benchmark mainFlow(100)
BenchmarkTools.Trial: 
  memory estimate:  14.63 GiB
  allocs estimate:  21551
  --------------
  minimum time:     12.903 s (10.27% GC)
  median time:      12.903 s (10.27% GC)
  mean time:        12.903 s (10.27% GC)
  maximum time:     12.903 s (10.27% GC)
  --------------
  samples:          1
  evals/sample:     1

版本信息
Win10，64位，Julia 1.3.0，编辑器用的ATOM。
如果有其他能够改善计算性能的建议，提前感谢！

woclass · 2019 年12 月 22 日 14:23

调整一下代码格式吧。

用代码块

```julia
# 粘贴你的代码
```

未优化代码: 9.296 s (10801 allocations: 7.38 GiB)

避免复制对象

u[:, :] 这样写会复制对象。

# `mainFlow` 7.38 GiB -> 6.31 GiB
# temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
temp1 = sum((u - u0) .^ 2 + (v - v0) .^ 2)

# `macroQuantity` 6.31 GiB -> 5.77 GiB
# uv[:, :] = @. sqrt(u[:, :]^2 + v[:, :]^2)
uv = @. sqrt(u^2 + v^2)

8.537 s (10263 allocations: 5.77 GiB)

该 `copy` 时，就 `copy`。

# `mainFlow`
# 尝试使用未初始化的数组，速度变慢
# u0 = Array{Float64}(undef, nx, ny)
# v0 = Array{Float64}(undef, nx, ny)

# `macroQuantity` 8.537 s -> 7.874 s
# ρ[:, :] = sum(f, dims = 3)
# u0[:, :] = u[:, :]
# v0[:, :] = v[:, :]
ρ = sum(f, dims = 3)
u0 = copy(u)
v0 = copy(v)

7.874 s (10201 allocations: 5.77 GiB)

类型标注

使用 @code_warntype 逐一检查函数

@code_warntype collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
少数中间变量类型为：Union{Nothing, Tuple{Int64,Int64}}
@code_warntype streaming(Q, cx, cy, f)
出现 ::Any
@code_warntype boundaryCondition(nx, ny, u, v, ρ, U, f)
无问题
@code_warntype macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
少数中间变量类型为：Union{Nothing, Tuple{Int64,Int64}}
有变量类型不稳定：ρ@_26::Union{Array{Float64,3}, Array{Float64,2}}
@code_warntype mainFlow(1)
有中间变量类型为：Union{Nothing, Tuple{Int64,Int64}}；

所有函数加上类型标注，但并没有帮助。
“7.798 s (10202 allocations: 5.77 GiB)”

ProfileView.jl 只是可视化了 @profile 的结果，帮助一般，看文本就行。

改用自带的性能分析 · Julia中文文档

.mem 文件

        - 
        - function mainFlow(iterStepMax::Int64)
348072636 Q = 9 #D2Q9
        - nx = 1200 #lattice number in horizontal direction
        - ny = 600 #lattice number in vertical direction
        - U = 0.1 #inlet velocity
      160 cx = [1.0 0.0 -1.0 0.0 1.0 -1.0 -1.0 1.0 0.0] #velocity components in horizontal direction
      160 cy = [0.0 1.0 0.0 -1.0 1.0 1.0 -1.0 -1.0 0.0] #velocity components in vertical direction
      160 w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters
        - # cx = [1.0, 0.0, -1.0, 0.0, 1.0, -1.0, -1.0, 1.0, 0.0] #velocity components in horizontal direction
        - # cy = [0.0, 1.0, 0.0, -1.0, 1.0, 1.0, -1.0, -1.0, 0.0] #velocity components in vertical direction
        - # w = [1.0 / 9.0, 1.0 / 9.0, 1.0 / 9.0, 1.0 / 9.0, 1.0 / 36.0, 1.0 / 36.0, 1.0 / 36.0, 1.0 / 36.0, 4.0 / 9.0] #weight parameters
        - 
  5760080 ρ = ones(Float64, nx, ny) #initial density
  5760080 u = zeros(Float64, nx, ny) #horizontal velocity component
  5760080 v = zeros(Float64, nx, ny) #vertical velocity component
  5760080 u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
  5760080 v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
        - # u0 = Array{Float64}(undef, nx, ny)
        - # v0 = Array{Float64}(undef, nx, ny)
  5760080 uv = zeros(Float64, nx, ny) #resultant velocity
 51840080 f = zeros(Float64, nx, ny, Q) #distribution function
 51840080 feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function
        - 
        - dx = 1.0 #horizontal lattice length
        - dy = 1.0 #vertical lattice length
        - lx = dx * nx #horizontal domain length
        - ly = dy * ny #vertical domain length
        - dt = dx #dt = 1.0
        - c = dx / dt #lattice sound speed
        - Re = 1000.0 #Reynoldz number
        - ν = U * ly / Re #kinematic viscousity
        - τ = 3.0ν + 0.5 #relaxation time
        - ω = 1.0 / τ #relaxation frequency
  4792960 println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
        - # iterStepMax = 10000
        - ε = 1.0e-6 #convergence requirement
        - 
        0 @views u[1, 2:ny-1] .= U
        0 for iterStep = 1:iterStepMax
        0     collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0     streaming(Q, cx, cy, f)
        0     boundaryCondition(nx, ny, u, v, ρ, U, f)
        0     macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        -     # temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
1440020000     temp1 = sum((u - u0) .^ 2 + (v - v0) .^ 2)
864012000     temp2 = sum(u .^ 2 + v .^ 2)
        0     err = sqrt(temp1) / sqrt(temp2 + 1e-30)
        0     if err <= ε
        0         break
        -     end
        - end
        - 
        - end
        - 
        - # function collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        - # function collision(nx, ny, Q, u, v, ρ::Array{Float64,2}, cx, cy, w, ω, f, feq)
        - function collision(nx::Int64, ny::Int64, Q::Int64, 
        -     u::Array{Float64,2}, v::Array{Float64,2}, ρ::Array{Float64,2}, 
        -     cx::Array{Float64,2}, cy::Array{Float64,2}, w::Array{Float64,2}, 
        -     ω::Float64, 
        -     f::Array{Float64,3}, feq::Array{Float64,3}
        - )
        0 for j = 1:ny, i = 1:nx
        0     t1 = u[i, j] * u[i, j] + v[i, j] * v[i, j]
        0     for k = 1:Q
        0         t2 = u[i, j] * cx[k] + v[i, j] * cy[k]
        0         feq[i, j, k] = ρ[i, j] * w[k] * (1.0 + 3.0t2 + 4.5 * t2^2 - 1.5t1)
        0         f[i, j, k] = (1.0 - ω) * f[i, j, k] + ω * feq[i, j, k]
        -     end
        - end
        - end
        - 
        - # function streaming(Q, cx, cy, f)
        - function streaming(Q::Int64, 
        -     cx::Array{Float64,2}, cy::Array{Float64,2}, 
        -     f::Array{Float64,3}
        - )
        0 for k = 1:Q
2617559063     @views f[:, :, k] .= circshift(f[:, :, k], [cx[k], cy[k]])
        - end
        - end
        - 
        - # function boundaryCondition(nx, ny, u, v, ρ, U, f)
        - function boundaryCondition(nx::Int64, ny::Int64, 
        -     u::Array{Float64,2}, v::Array{Float64,2}, ρ::Array{Float64,2}, U::Float64, f::Array{Float64,3}
        - )
        - #left
   992000 @views f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] .* U * 2.0 / 3.0
  1491200 @views f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
  1491200 @views f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
        0 @views u[1, 2:ny-1] .= U
        0 @views v[1, 2:ny-1] .= 0.0
        - 
        - #right
   252800 @views f[nx, :, 3] = f[nx-1, :, 3]
   252800 @views f[nx, :, 6] = f[nx-1, :, 6]
   252800 @views f[nx, :, 7] = f[nx-1, :, 7]
        - 
        - #bottom
   492800 @views f[:, 1, 2] = f[:, 1, 4]
   492800 @views f[:, 1, 5] = f[:, 1, 7]
   492800 @views f[:, 1, 6] = f[:, 1, 8]
        0 @views u[:, 1] .= 0.0
        0 @views v[:, 1] .= 0.0
        - 
        - #top
   492800 @views f[:, ny, 4] = f[:, ny, 2]
   492800 @views f[:, ny, 8] = f[:, ny, 6]
   492800 @views f[:, ny, 7] = f[:, ny, 5]
        0 @views u[:, ny] .= 0.0
     2400 @views v[:, ny] .= 0.0
        - end
        - 
        - # function macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        - function macroQuantity(nx::Int64, ny::Int64, 
        -     Q::Int64, u::Array{Float64,2}, v::Array{Float64,2}, 
        -     uv::Array{Float64,2}, 
        -     u0::Array{Float64,2}, v0::Array{Float64,2}, 
        -     ρ::Array{Float64,2}, 
        -     cx::Array{Float64,2}, cy::Array{Float64,2}, 
        -     f::Array{Float64,3}
        - )
        - # ρ[:, :] = sum(f, dims = 3)
        - # u0[:, :] = u[:, :]
        - # v0[:, :] = v[:, :]
288016000 ρ = sum(f, dims = 3)
288004000 u0 = copy(u)
288004000 v0 = copy(v)
        - 
        0 for j = 1:ny, i = 1:nx
        -     uSum = 0.0
        -     vSum = 0.0
        0     for k = 1:Q
        0         uSum += f[i, j, k] * cx[k]
        0         vSum += f[i, j, k] * cy[k]
        -     end
        0     u[i, j] = uSum / ρ[i, j]
        0     v[i, j] = vSum / ρ[i, j]
        - end
        - # uv[:, :] = @. sqrt(u[:, :]^2 + v[:, :]^2)
288004000 uv = @. sqrt(u^2 + v^2)
        - end
        - 
        - # using BenchmarkTools
        - # @btime mainFlow(50)
        - mainFlow(50)

Profile.@profile

julia> Profile.print()
3081 .\task.jl:333; (::REPL.var"#26#27"{REPL.REPLBackend})()
 3081 ...re\julia\stdlib\v1.3\REPL\src\REPL.jl:118; macro expansion
  3081 ...re\julia\stdlib\v1.3\REPL\src\REPL.jl:86; eval_user_input(::Any, ::REPL.REPLBackend)
   3081 .\boot.jl:330; eval(::Module, ::Any)
    7   ...p\proj\Julia\JuliaCN\2763-mem\2763.jl:15; mainFlow(::Int64)
    970 ...p\proj\Julia\JuliaCN\2763-mem\2763.jl:41; mainFlow(::Int64)
     1   ...\proj\Julia\JuliaCN\2763-mem\2763.jl:0; collision(::Int64, ::Int64, ::Int64, ::Array{Float64,2},...
     39  ...\proj\Julia\JuliaCN\2763-mem\2763.jl:65; collision(::Int64, ::Int64, ::Int64, ::Array{Float64,2},...
      4  .\array.jl:0; getindex
      19 .\array.jl:745; getindex
      3  .\float.jl:405; *
      13 .\float.jl:401; +
     1   ...\proj\Julia\JuliaCN\2763-mem\2763.jl:66; collision(::Int64, ::Int64, ::Int64, ::Array{Float64,2},...
     220 ...\proj\Julia\JuliaCN\2763-mem\2763.jl:67; collision(::Int64, ::Int64, ::Int64, ::Array{Float64,2},...
      5   .\array.jl:744; getindex
      103 .\array.jl:745; getindex
      86  .\float.jl:405; *
      26  .\float.jl:401; +
     407 ...\proj\Julia\JuliaCN\2763-mem\2763.jl:68; collision(::Int64, ::Int64, ::Int64, ::Array{Float64,2},...
      30 .\array.jl:744; getindex
      57 .\array.jl:745; getindex
      62 .\array.jl:784; setindex!
      74 .\float.jl:405; *
      45 .\float.jl:403; -
      5  .\intfuncs.jl:244; literal_pow
       5 .\float.jl:405; *
      78 .\operators.jl:529; *
       78 .\float.jl:405; *
      56 .\operators.jl:529; +
       56 .\float.jl:401; +
     302 ...\proj\Julia\JuliaCN\2763-mem\2763.jl:69; collision(::Int64, ::Int64, ::Int64, ::Array{Float64,2},...
      36  .\array.jl:745; getindex
      26  .\array.jl:784; setindex!
      173 .\float.jl:405; *
      64  .\float.jl:401; +
      3   .\range.jl:598; iterate
       3 .\int.jl:53; +
    775 ...p\proj\Julia\JuliaCN\2763-mem\2763.jl:42; mainFlow(::Int64)
     775 ...\proj\Julia\JuliaCN\2763-mem\2763.jl:80; streaming(::Int64, ::Array{Float64,2}, ::Array{Float64,2...
      589 .\abstractarraymath.jl:242; circshift(::SubArray{Float64,2,Array{Float64,3},Tuple{B...
       100 .\abstractarray.jl:625; similar
        100 .\abstractarray.jl:626; similar
         100 .\subarray.jl:67; similar
          100 .\array.jl:334; similar
           100 .\boot.jl:414; Array
            100 .\boot.jl:406; Array
       479 .\multidimensional.jl:961; circshift!(::Array{Float64,2}, ::SubArray{Float64,2,Ar...
        211 .\multidimensional.jl:990; _circshift!
         73  .\multidimensional.jl:990; _circshift!
          73 .\multidimensional.jl:995; _circshift!
           73 .\multidimensional.jl:912; copyto!(::Array{Float64,2}, ::CartesianIndices{2,Tup...
            73 .\multidimensional.jl:924; macro expansion
             73 .\cartesian.jl:64; macro expansion
              2  .\array.jl:784; macro expansion
              71 .\multidimensional.jl:925; macro expansion
               1  .\array.jl:784; setindex!
               70 .\subarray.jl:230; getindex
                70 .\array.jl:745; getindex
         138 .\multidimensional.jl:991; _circshift!
          138 .\multidimensional.jl:995; _circshift!
           138 .\multidimensional.jl:912; copyto!(::Array{Float64,2}, ::CartesianIndices{2,Tup...
            138 .\multidimensional.jl:924; macro expansion
             138 .\cartesian.jl:64; macro expansion
              4   .\array.jl:784; macro expansion
              134 .\multidimensional.jl:925; macro expansion
               5   .\array.jl:784; setindex!
               129 .\subarray.jl:230; getindex
                129 .\array.jl:745; getindex
        268 .\multidimensional.jl:991; _circshift!
         92  .\multidimensional.jl:990; _circshift!
          92 .\multidimensional.jl:995; _circshift!
           92 .\multidimensional.jl:912; copyto!(::Array{Float64,2}, ::CartesianIndices{2,Tup...
            92 .\multidimensional.jl:924; macro expansion
             92 .\cartesian.jl:64; macro expansion
              92 .\multidimensional.jl:925; macro expansion
               23 .\array.jl:784; setindex!
               69 .\subarray.jl:230; getindex
                69 .\array.jl:745; getindex
         176 .\multidimensional.jl:991; _circshift!
          176 .\multidimensional.jl:995; _circshift!
           176 .\multidimensional.jl:912; copyto!(::Array{Float64,2}, ::CartesianIndices{2,Tup...
            176 .\multidimensional.jl:924; macro expansion
             176 .\cartesian.jl:64; macro expansion
              176 .\multidimensional.jl:925; macro expansion
               54  .\array.jl:784; setindex!
               122 .\subarray.jl:230; getindex
                122 .\array.jl:745; getindex
       2   .\tuple.jl:140; map(::Type, ::Tuple{Float64,Float64})
        1 .\float.jl:0; Integer(::Float64)
      186 .\broadcast.jl:822; materialize!(::SubArray{Float64,2,Array{Float64,3},Tupl...
       186 .\broadcast.jl:863; copyto!
        186 .\broadcast.jl:904; copyto!
         185 .\multidimensional.jl:905; copyto!(::SubArray{Float64,2,Array{Float64,3},Tuple{B...
          172 .\array.jl:744; getindex
          13  .\subarray.jl:283; setindex!
           13 .\array.jl:782; setindex!
    761 ...p\proj\Julia\JuliaCN\2763-mem\2763.jl:44; mainFlow(::Int64)
     49  .\reducedim.jl:92; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      39 .\array.jl:310; fill!(::Array{Float64,3}, ::Float64)
      10 .\array.jl:311; fill!(::Array{Float64,3}, ::Float64)
       10 .\array.jl:782; setindex!
     5   ...\proj\Julia\JuliaCN\2763-mem\2763.jl:0; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
     200 ...\proj\Julia\JuliaCN\2763-mem\2763.jl:127; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      200 .\none:0; #sum
       200 .\reducedim.jl:652; #sum#587
        200 .\reducedim.jl:678; _sum
         200 .\reducedim.jl:679; _sum
          200 .\none:0; #mapreduce
           200 .\reducedim.jl:307; #mapreduce#584
            187 .\reducedim.jl:274; _mapreduce_dim
             187 .\reducedim.jl:266; _mapreducedim!(::typeof(identity), ::typeof(Base.ad...
              87  .\array.jl:745; macro expansion
              100 .\simdloop.jl:77; macro expansion
               97 .\float.jl:401; macro expansion
               3  .\reducedim.jl:267; macro expansion
                3 .\multidimensional.jl:488; setindex!
                 3 .\array.jl:784; setindex!
            13  .\reducedim.jl:317; _mapreduce_dim
             13 .\reducedim.jl:172; reducedim_init
              13 .\reducedim.jl:93; reducedim_initarray
               13 .\boot.jl:408; reducedim_initarray
     63  ...\proj\Julia\JuliaCN\2763-mem\2763.jl:128; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      63 .\array.jl:325; copy
     60  ...\proj\Julia\JuliaCN\2763-mem\2763.jl:129; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      60 .\array.jl:325; copy
     1   ...\proj\Julia\JuliaCN\2763-mem\2763.jl:131; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
     2   ...\proj\Julia\JuliaCN\2763-mem\2763.jl:134; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
     183 ...\proj\Julia\JuliaCN\2763-mem\2763.jl:135; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      7   .\array.jl:0; getindex
      45  .\array.jl:745; getindex
      115 .\float.jl:405; *
      16  .\float.jl:401; +
     9   ...\proj\Julia\JuliaCN\2763-mem\2763.jl:136; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      9 .\array.jl:744; getindex
     48  ...\proj\Julia\JuliaCN\2763-mem\2763.jl:138; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      10 .\array.jl:745; getindex
      8  .\array.jl:784; setindex!
      30 .\float.jl:407; /
     68  ...\proj\Julia\JuliaCN\2763-mem\2763.jl:139; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      7  .\array.jl:784; setindex!
      61 .\float.jl:407; /
     73  ...\proj\Julia\JuliaCN\2763-mem\2763.jl:142; macroQuantity(::Int64, ::Int64, ::Int64, ::Array{Float6...
      12 .\boot.jl:406; materialize
      61 .\broadcast.jl:819; materialize
       61 .\broadcast.jl:839; copy
        61 .\broadcast.jl:863; copyto!
         61 .\broadcast.jl:908; copyto!
          61 .\simdloop.jl:77; macro expansion
           61 .\broadcast.jl:909; macro expansion
            56 .\broadcast.jl:563; getindex
             9  .\broadcast.jl:602; _broadcast_getindex
    341 ...p\proj\Julia\JuliaCN\2763-mem\2763.jl:46; mainFlow(::Int64)
     77  .\arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
      77 .\broadcast.jl:808; broadcast_preserving_zero_d
       15 .\boot.jl:406; materialize
       62 .\broadcast.jl:819; materialize
        62 .\broadcast.jl:839; copy
         62 .\broadcast.jl:863; copyto!
          62 .\broadcast.jl:908; copyto!
           62 .\simdloop.jl:77; macro expansion
            62 .\broadcast.jl:909; macro expansion
             61 .\broadcast.jl:563; getindex
              7  .\broadcast.jl:602; _broadcast_getindex
              54 .\broadcast.jl:603; _broadcast_getindex
               54 .\broadcast.jl:630; _broadcast_getindex_evalf
                54 .\float.jl:401; +
     135 .\arraymath.jl:39; -(::Array{Float64,2}, ::Array{Float64,2})
      135 .\broadcast.jl:808; broadcast_preserving_zero_d
       21  .\boot.jl:406; materialize
       114 .\broadcast.jl:819; materialize
        114 .\broadcast.jl:839; copy
         114 .\broadcast.jl:863; copyto!
          114 .\broadcast.jl:908; copyto!
           114 .\simdloop.jl:77; macro expansion
            4   .\array.jl:784; macro expansion
            110 .\broadcast.jl:909; macro expansion
             110 .\broadcast.jl:563; getindex
              18 .\broadcast.jl:602; _broadcast_getindex
               18 .\broadcast.jl:626; _getindex
                18 .\broadcast.jl:596; _broadcast_getindex
                 18 .\multidimensional.jl:486; getindex
                  18 .\array.jl:745; getindex
              92 .\float.jl:403; _broadcast_getindex
     124 .\broadcast.jl:819; materialize
      124 .\broadcast.jl:839; copy
       102 .\broadcast.jl:863; copyto!
        1   .\broadcast.jl:907; copyto!
        101 .\broadcast.jl:908; copyto!
         101 .\simdloop.jl:77; macro expansion
          4  .\array.jl:784; macro expansion
          97 .\broadcast.jl:909; macro expansion
           97 .\broadcast.jl:563; getindex
            25 .\broadcast.jl:602; _broadcast_getindex
             25 .\broadcast.jl:626; _getindex
              25 .\broadcast.jl:626; _getindex
               25 .\broadcast.jl:596; _broadcast_getindex
                25 .\multidimensional.jl:486; getindex
                 23 .\array.jl:745; getindex
                 2  .\int.jl:53; getindex
            36 .\broadcast.jl:603; _broadcast_getindex
             36 .\broadcast.jl:630; _broadcast_getindex_evalf
              36 .\intfuncs.jl:244; literal_pow
               36 .\float.jl:405; *
            36 .\float.jl:405; _broadcast_getindex
       22  .\broadcast.jl:196; similar
        22 .\abstractarray.jl:671; similar
         22 .\abstractarray.jl:672; similar
          22 .\boot.jl:421; Array
           22 .\boot.jl:414; Array
            22 .\boot.jl:406; Array
    200 ...p\proj\Julia\JuliaCN\2763-mem\2763.jl:47; mainFlow(::Int64)
     72  .\arraymath.jl:47; +(::Array{Float64,2}, ::Array{Float64,2})
      72 .\broadcast.jl:808; broadcast_preserving_zero_d
       11 .\boot.jl:406; materialize
       61 .\broadcast.jl:819; materialize
        61 .\broadcast.jl:839; copy
         61 .\broadcast.jl:863; copyto!
          61 .\broadcast.jl:908; copyto!
           61 .\simdloop.jl:77; macro expansion
            61 .\broadcast.jl:909; macro expansion
             56 .\broadcast.jl:563; getindex
              6  .\broadcast.jl:602; _broadcast_getindex
              50 .\broadcast.jl:603; _broadcast_getindex
               50 .\broadcast.jl:630; _broadcast_getindex_evalf
                50 .\float.jl:401; +
     118 .\broadcast.jl:819; materialize
      118 .\broadcast.jl:839; copy
       106 .\broadcast.jl:863; copyto!
        106 .\broadcast.jl:908; copyto!
         106 .\simdloop.jl:77; macro expansion
          5   .\array.jl:784; macro expansion
          101 .\broadcast.jl:909; macro expansion
           99 .\broadcast.jl:563; getindex
            34 .\broadcast.jl:602; _broadcast_getindex
            33 .\broadcast.jl:603; _broadcast_getindex
            32 .\float.jl:405; _broadcast_getindex
       12  .\broadcast.jl:196; similar
     10  .\reducedim.jl:652; sum

julia>

circshift 问题很大

看起来还有很多地方分配了内存，可以试试上循环。

tyfdzt · 2019 年12 月 22 日 14:27

感谢您的意见，因为从matlab转过来不久，所以还带着很多matlab的书写习惯，用ProfileView.jl以后再改改看

azurefx · 2019 年12 月 22 日 15:05

内存分配失败不会导致 NaN，仅仅会让你的程序挂掉…（笑）
通常NaN来自于浮点数异常，比如迭代着数值飞了
另外 Julia 的垃圾回收时机是下一次内存分配时。如果总内存占用不大，但 allocations 很多，说明你有很多不必要的复制，或者类型不稳定导致的动态化开销。性能优化楼上说得差不多了，多看看官网 Performance Tips 就好，重点关注数组的不同操作的语义。

woclass · 2019 年12 月 23 日 07:45

依据前文给出的 .mem 内存分配计数，进一步优化，尽量减少循环中分配内存。

collision 函数里面没分配过内存。尽量做到这样。
streaming 是内存分配的大头，主要问题是 circshift。有点复杂暂时没思路。
boundaryCondition

macroQuantity 使用 inplace 操作避免内存分配。
主要问题在以下几句

# mem allo
288016000 ρ = sum(f, dims = 3)
288004000 u0 = copy(u)
288004000 v0 = copy(v)

288004000 uv = @. sqrt(u^2 + v^2)

基准 “11.641 s (10202 allocations: 5.77 GiB)” （不知道为啥变慢了

inplace 操作

`macroQuantity()`

前面提到过 ρ 类型不稳定，但一直不知道为什么。发现是 sum 的锅。
ρ = sum(f, dims = 3)；参数类型 f::Array{Float64,3}, ρ::Array{Float64,2}
sum(f, dims = 3) 会返回一个 10×10×1 Array{Int64,3}: 导致 ρ 类型不稳定

类型不稳定举例

f = rand(Int, 10,10,8);
r1 = sum(f, dims = 3);
r2 = rand(Int, 10,10);

typeof(r1)  # Array{Int64,3}
typeof(r2)  # Array{Int64,2}
typeof(r1[:,:,1])   # Array{Int64,2}
r1[:,:,1] == r2     # true

sum => sum!
copy => copy!
uv = @. sqrt(u^2 + v^2) => @. uv = sqrt(u^2 + v^2)

diff

# ρ = sum(f, dims = 3)
# u0 = copy(u)
# v0 = copy(v)
sum!(ρ, f)
copy!(u0, u)
copy!(v0, v)

# uv = @. sqrt(u^2 + v^2)
@. uv = sqrt(u^2 + v^2)
# @. uv[:] = sqrt(u[:]^2 + v[:]^2) #  9853 allocations: 5.23 GiB

“15.818 s (9449 allocations: 4.70 GiB)”

测了下新的内存分配,macroQuantity() 已经不会分配内存了。

woclass · 2019 年12 月 23 日 08:18

基准 “10.228 s (9402 allocations: 4.70 GiB)”

自己给自己

boundaryCondition()

@. @views 的组合测试

view.jl


const nx = 1200
const ny = 600
const Q = 8
f = rand(Int, nx, ny, Q);

function r0!(f) # original
    @views f[:, ny, 4] = f[:, ny, 2]
    @views f[:, ny, 8] = f[:, ny, 6]
    @views f[:, ny, 7] = f[:, ny, 5]
end # 7.767 μs (9 allocations: 28.88 KiB)

function r1!(f) # nothing
    f[:, ny, 4] = f[:, ny, 2]
    f[:, ny, 8] = f[:, ny, 6]
    f[:, ny, 7] = f[:, ny, 5]
end # 6.140 μs (3 allocations: 28.50 KiB)

function r2!(f) # dot macro `@.` only
    @. f[:, ny, 4] = f[:, ny, 2]
    @. f[:, ny, 8] = f[:, ny, 6]
    @. f[:, ny, 7] = f[:, ny, 5]
end # 3.600 μs (6 allocations: 28.69 KiB)

function r3!(f) # dot only
    f[:, ny, 4] .= f[:, ny, 2]
    f[:, ny, 8] .= f[:, ny, 6]
    f[:, ny, 7] .= f[:, ny, 5]
end # 3.686 μs (6 allocations: 28.69 KiB)

# @. @views
function r4!(f) # @. @views
    @. @views f[:, ny, 4] = f[:, ny, 2]
    @. @views f[:, ny, 8] = f[:, ny, 6]
    @. @views f[:, ny, 7] = f[:, ny, 5]
end # 1.230 μs (6 allocations: 384 bytes)

function r5!(f) # @views @.
    @views @. f[:, ny, 4] = f[:, ny, 2]
    @views @. f[:, ny, 8] = f[:, ny, 6]
    @views @. f[:, ny, 7] = f[:, ny, 5]
end # 1.210 μs (6 allocations: 384 bytes)

function r6!(f) # add .
    @views f[:, ny, 4] .= f[:, ny, 2]
    @views f[:, ny, 8] .= f[:, ny, 6]
    @views f[:, ny, 7] .= f[:, ny, 5]
end # 1.230 μs (6 allocations: 384 bytes)

function r7!(f) # add .
    copy!(f[:, ny, 4], f[:, ny, 2])
    copy!(f[:, ny, 8], f[:, ny, 6])
    copy!(f[:, ny, 7], f[:, ny, 5])
end # 6.750 μs (6 allocations: 57.00 KiB)

# julia> @btime r0!($f);
#   7.767 μs (9 allocations: 28.88 KiB)
# julia> @btime r1!($f);
#   6.140 μs (3 allocations: 28.50 KiB)
# julia> @btime r2!($f);
#   3.600 μs (6 allocations: 28.69 KiB)
# julia> @btime r3!($f);
#   3.686 μs (6 allocations: 28.69 KiB)
# julia> @btime r4!($f);
#   1.230 μs (6 allocations: 384 bytes)
# julia> @btime r5!($f);
#   1.210 μs (6 allocations: 384 bytes)
# julia> @btime r6!($f);
#   1.230 μs (6 allocations: 384 bytes)
# julia> @btime r7!($f);
#   6.750 μs (6 allocations: 57.00 KiB)

diff

#left
# @views f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] .* U * 2.0 / 3.0
# @views f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
# @views f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
@views @. f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] * U * 2.0 / 3.0
@views @. f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
@views @. f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0

#right
# @views f[nx, :, 3] = f[nx-1, :, 3]
# @views f[nx, :, 6] = f[nx-1, :, 6]
# @views f[nx, :, 7] = f[nx-1, :, 7]
@views @. f[nx, :, 3] = f[nx-1, :, 3]
@views @. f[nx, :, 6] = f[nx-1, :, 6]
@views @. f[nx, :, 7] = f[nx-1, :, 7]

#bottom
# @views f[:, 1, 2] = f[:, 1, 4]
# @views f[:, 1, 5] = f[:, 1, 7]
# @views f[:, 1, 6] = f[:, 1, 8]
@views @. f[:, 1, 2] = f[:, 1, 4]
@views @. f[:, 1, 5] = f[:, 1, 7]
@views @. f[:, 1, 6] = f[:, 1, 8]

#top
# @views f[:, ny, 4] = f[:, ny, 2]
# @views f[:, ny, 8] = f[:, ny, 6]
# @views f[:, ny, 7] = f[:, ny, 5]
@views @. f[:, ny, 4] = f[:, ny, 2]
@views @. f[:, ny, 8] = f[:, ny, 6]
@views @. f[:, ny, 7] = f[:, ny, 5]

mem

        - # function boundaryCondition(nx, ny, u, v, ρ, U, f)
        - function boundaryCondition(nx::Int64, ny::Int64, 
        -     u::Array{Float64,2}, v::Array{Float64,2}, ρ::Array{Float64,2}, U::Float64, f::Array{Float64,3}
        - )
        - #left
        - # @views f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] .* U * 2.0 / 3.0
        - # @views f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
        - # @views f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
     9600 @views @. f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] * U * 2.0 / 3.0
    16000 @views @. f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
    16000 @views @. f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
        0 @views u[1, 2:ny-1] .= U
        0 @views v[1, 2:ny-1] .= 0.0
        - 
        - #right
        - # @views f[nx, :, 3] = f[nx-1, :, 3]
        - # @views f[nx, :, 6] = f[nx-1, :, 6]
        - # @views f[nx, :, 7] = f[nx-1, :, 7]
     6400 @views @. f[nx, :, 3] = f[nx-1, :, 3]
     6400 @views @. f[nx, :, 6] = f[nx-1, :, 6]
     6400 @views @. f[nx, :, 7] = f[nx-1, :, 7]
        - 
        - #bottom
        - # @views f[:, 1, 2] = f[:, 1, 4]
        - # @views f[:, 1, 5] = f[:, 1, 7]
        - # @views f[:, 1, 6] = f[:, 1, 8]
     6400 @views @. f[:, 1, 2] = f[:, 1, 4]
     6400 @views @. f[:, 1, 5] = f[:, 1, 7]
     6400 @views @. f[:, 1, 6] = f[:, 1, 8]
        0 @views u[:, 1] .= 0.0
        0 @views v[:, 1] .= 0.0
        - 
        - #top
        - # @views f[:, ny, 4] = f[:, ny, 2]
        - # @views f[:, ny, 8] = f[:, ny, 6]
        - # @views f[:, ny, 7] = f[:, ny, 5]
     6400 @views @. f[:, ny, 4] = f[:, ny, 2]
     6400 @views @. f[:, ny, 8] = f[:, ny, 6]
     6400 @views @. f[:, ny, 7] = f[:, ny, 5]
        0 @views u[:, ny] .= 0.0
     2400 @views v[:, ny] .= 0.0
        - end

减少了内存分配，不过这里本来就不是瓶颈
“8.972 s (8302 allocations: 4.69 GiB)”

tyfdzt · 2019 年12 月 23 日 08:35

真的十分感谢您能一直对这个问题保持关注，并且给予十分有效的改进意见。这个格子玻尔兹曼是cfd模拟的一个工具，circshift是matlab的经典写法，也有用C语言实现的同一问题的代码，里面没有circshift函数，但能实现了相同的计算结果。我会尽快把改过的代码放上来，请您继续关注。感谢！

woclass · 2019 年12 月 23 日 08:55

将 mainFlow 里计算相对误差的部分~~拿出来成为一个函数~~就地改写。

第一个 temp1 分配的内存是程序里动态分配第二多的。（~~早干嘛去了~~应该先优化这个的
mem

        0 for iterStep = 1:iterStepMax
        0     collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0     streaming(Q, cx, cy, f)
        0     boundaryCondition(nx, ny, u, v, ρ, U, f)
        0     macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        -     # temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
1440020000     temp1 = sum((u - u0) .^ 2 + (v - v0) .^ 2)
864012000     temp2 = sum(u .^ 2 + v .^ 2)
        0     err = sqrt(temp1) / sqrt(temp2 + 1e-30)
        0     if err <= ε
        0         break
        -     end
        - end



        0 for k = 1:Q
2617296789     @views f[:, :, k] .= circshift(f[:, :, k], [cx[k], cy[k]])
        - end

尽量少分配内存

diff

# 在 mainFlow 内，第一个循环外预分配内存
rel_err_temp = ones(Float64, nx, ny)

# for iterStep = 1:iterStepMax 循环内
    # temp1 = sum((u - u0) .^ 2 + (v - v0) .^ 2)
    # temp2 = sum(u .^ 2 + v .^ 2)
    # err = sqrt(temp1) / sqrt(temp2 + 1e-30)
    @. rel_err_temp = (u - u0)^2 + (v - v0)^2
    temp1 = sum(rel_err_temp)
    @. rel_err_temp = u^2 + v^2
    temp2 = sum(rel_err_temp)
    
    err = sqrt(temp1 / (temp2 + 1e-30))

基准（新增变量 rel_err_temp 后）“7.698 s (8305 allocations: 4.69 GiB)”
“7.671 s (7504 allocations: 2.55 GiB)”

mem

        - function mainFlow(iterStepMax::Int64)
        0 Q = 9 #D2Q9
        - nx = 1200 #lattice number in horizontal direction
        - ny = 600 #lattice number in vertical direction
        - U = 0.1 #inlet velocity
      160 cx = [1.0 0.0 -1.0 0.0 1.0 -1.0 -1.0 1.0 0.0] #velocity components in horizontal direction
      160 cy = [0.0 1.0 0.0 -1.0 1.0 1.0 -1.0 -1.0 0.0] #velocity components in vertical direction
      160 w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters
        - # cx = [1.0, 0.0, -1.0, 0.0, 1.0, -1.0, -1.0, 1.0, 0.0] #velocity components in horizontal direction
        - # cy = [0.0, 1.0, 0.0, -1.0, 1.0, 1.0, -1.0, -1.0, 0.0] #velocity components in vertical direction
        - # w = [1.0 / 9.0, 1.0 / 9.0, 1.0 / 9.0, 1.0 / 9.0, 1.0 / 36.0, 1.0 / 36.0, 1.0 / 36.0, 1.0 / 36.0, 4.0 / 9.0] #weight parameters
        - 
  5760080 rel_err_temp = Array{Float64}(undef, nx, ny)
  5760080 ρ = ones(Float64, nx, ny) #initial density
  5760080 u = zeros(Float64, nx, ny) #horizontal velocity component
  5760080 v = zeros(Float64, nx, ny) #vertical velocity component
  5760080 u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
  5760080 v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
        - # u0 = Array{Float64}(undef, nx, ny)
        - # v0 = Array{Float64}(undef, nx, ny)
  5760080 uv = zeros(Float64, nx, ny) #resultant velocity
 51840080 f = zeros(Float64, nx, ny, Q) #distribution function
 51840080 feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function
        - 
        - dx = 1.0 #horizontal lattice length
        - dy = 1.0 #vertical lattice length
        - lx = dx * nx #horizontal domain length
        - ly = dy * ny #vertical domain length
        - dt = dx #dt = 1.0
        - c = dx / dt #lattice sound speed
        - Re = 1000.0 #Reynoldz number
        - ν = U * ly / Re #kinematic viscousity
        - τ = 3.0ν + 0.5 #relaxation time
        - ω = 1.0 / τ #relaxation frequency
  4792960 println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
        - # iterStepMax = 10000
        - ε = 1.0e-6 #convergence requirement
        - 
        0 @views u[1, 2:ny-1] .= U
        0 for iterStep = 1:iterStepMax
        0     collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0     streaming(Q, cx, cy, f)
        0     boundaryCondition(nx, ny, u, v, ρ, U, f)
        0     macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        -     
        -     # temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
        -     # temp1 = sum((u - u0) .^ 2 + (v - v0) .^ 2)
        -     # temp2 = sum(u .^ 2 + v .^ 2)
        -     # err = sqrt(temp1) / sqrt(temp2 + 1e-30)
        0     @. rel_err_temp = (u - u0)^2 + (v - v0)^2
        0     temp1 = sum(rel_err_temp)
        0     @. rel_err_temp = u^2 + v^2
        0     temp2 = sum(rel_err_temp)
        -     
        0     err = sqrt(temp1 / (temp2 + 1e-30))
        0     if err <= ε
        0         break
        -     end
        - end
        0     nx, ny, u, v
        - end

woclass · 2019 年12 月 23 日 09:00

我是觉得有个实际的例子做做优化也挺有意思的，我也能学到不少知识。

应该上来就用自带的 julia --track-allocation=user filename.jl 看一下内存分配的。然后从大头开始。

woclass · 2019 年12 月 23 日 09:16

给一版优化后的完整代码

~~类型稳定和内存分配问题最大的就剩下 circshift 了。~~

circshift 类型不稳定及内存分配问题的解决

类型不稳定是因为你给的 cx, cy 是用来当作索引值的应该是个 Tuple{Int64,Int64}，然后代码里给成了 Array{Float64,1}

更正之后快了一点

7.060 s (7503 allocations: 2.55 GiB)
6.694 s (3507 allocations: 2.55 GiB)

# cx = [1.0 0.0 -1.0 0.0 1.0 -1.0 -1.0 1.0 0.0] #velocity components in horizontal direction
# cy = [0.0 1.0 0.0 -1.0 1.0 1.0 -1.0 -1.0 0.0] #velocity components in vertical direction
cx = [1 0 -1  0 1 -1 -1  1 0] #velocity components in horizontal direction
cy = [0 1  0 -1 1  1 -1 -1 0] #velocity components in vertical direction


    # @views f[:, :, k] .= circshift(f[:, :, k], [cx[k], cy[k]])
    @views f[:, :, k] .= circshift(f[:, :, k], (cx[k], cy[k]))

julia 的 circshift 实现有点问题 #25003，
换用 piever/ShiftedArrays.jl: Lazy shifted arrays for data analysis in Julia

9.204 s (10858 allocations: 7.38 GiB)
~~7.060 s (7503 allocations: 2.55 GiB)~~ 修复类型不稳定
~~6.694 s (3507 allocations: 2.55 GiB)~~ 换用其他 circshift 实现
5.533 s (3003 allocations: 137.50 MiB)

julia> @benchmark mainFlow(50)
BenchmarkTools.Trial:
  memory estimate:  137.50 MiB
  allocs estimate:  3051
  --------------
  minimum time:     5.292 s (0.02% GC)
  median time:      5.292 s (0.02% GC)
  mean time:        5.292 s (0.02% GC)
  maximum time:     5.292 s (0.02% GC)
  --------------
  samples:          1
  evals/sample:     1

优化后的代码：

2763v3.0.jl

using ShiftedArrays

function mainFlow(iterStepMax::Int64)
Q = 9 #D2Q9
nx = 1200 #lattice number in horizontal direction
ny = 600 #lattice number in vertical direction
U = 0.1 #inlet velocity
# cx = [1.0 0.0 -1.0 0.0 1.0 -1.0 -1.0 1.0 0.0] #velocity components in horizontal direction
# cy = [0.0 1.0 0.0 -1.0 1.0 1.0 -1.0 -1.0 0.0] #velocity components in vertical direction
cx = [1 0 -1  0 1 -1 -1  1 0] #velocity components in horizontal direction
cy = [0 1  0 -1 1  1 -1 -1 0] #velocity components in vertical direction

w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters

# rel_err_temp = Array{Float64}(undef, nx, ny)
rel_err_temp = ones(Float64, nx, ny)
ρ = ones(Float64, nx, ny) #initial density
u = zeros(Float64, nx, ny) #horizontal velocity component
v = zeros(Float64, nx, ny) #vertical velocity component
u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
# u0 = Array{Float64}(undef, nx, ny)
# v0 = Array{Float64}(undef, nx, ny)
uv = zeros(Float64, nx, ny) #resultant velocity
f = zeros(Float64, nx, ny, Q) #distribution function
feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function

dx = 1.0 #horizontal lattice length
dy = 1.0 #vertical lattice length
lx = dx * nx #horizontal domain length
ly = dy * ny #vertical domain length
dt = dx #dt = 1.0
c = dx / dt #lattice sound speed
Re = 1000.0 #Reynoldz number
ν = U * ly / Re #kinematic viscousity
τ = 3.0ν + 0.5 #relaxation time
ω = 1.0 / τ #relaxation frequency
println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
# iterStepMax = 10000
ε = 1.0e-6 #convergence requirement

@views u[1, 2:ny-1] .= U
for iterStep = 1:iterStepMax
    collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
    streaming(Q, cx, cy, f)
    boundaryCondition(nx, ny, u, v, ρ, U, f)
    macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
    
    # temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
    # temp2 = sum(u .^ 2 + v .^ 2)
    # err = sqrt(temp1) / sqrt(temp2 + 1e-30)
    @. rel_err_temp = (u - u0)^2 + (v - v0)^2
    temp1 = sum(rel_err_temp)
    @. rel_err_temp = u^2 + v^2
    temp2 = sum(rel_err_temp)
    
    err = sqrt(temp1 / (temp2 + 1e-30))
    if err <= ε
        break
    end
end
end

function collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
for j = 1:ny, i = 1:nx
    t1 = u[i, j] * u[i, j] + v[i, j] * v[i, j]
    for k = 1:Q
        t2 = u[i, j] * cx[k] + v[i, j] * cy[k]
        feq[i, j, k] = ρ[i, j] * w[k] * (1.0 + 3.0t2 + 4.5 * t2^2 - 1.5t1)
        f[i, j, k] = (1.0 - ω) * f[i, j, k] + ω * feq[i, j, k]
    end
end
end

function streaming(Q, cx, cy, f)
for k = 1:Q
    # @views f[:, :, k] .= circshift(f[:, :, k], [cx[k], cy[k]])
    # @views f[:, :, k] .= circshift(f[:, :, k], (cx[k], cy[k]))
    @views f[:, :, k] .= ShiftedArrays.circshift(f[:, :, k], (cx[k], cy[k]))
end
end

function boundaryCondition(nx, ny, u, v, ρ, U, f)
#left
# @views f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] .* U * 2.0 / 3.0
# @views f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
# @views f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
@views @. f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] * U * 2.0 / 3.0
@views @. f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
@views @. f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
@views u[1, 2:ny-1] .= U
@views v[1, 2:ny-1] .= 0.0

#right
# @views f[nx, :, 3] = f[nx-1, :, 3]
# @views f[nx, :, 6] = f[nx-1, :, 6]
# @views f[nx, :, 7] = f[nx-1, :, 7]
@views @. f[nx, :, 3] = f[nx-1, :, 3]
@views @. f[nx, :, 6] = f[nx-1, :, 6]
@views @. f[nx, :, 7] = f[nx-1, :, 7]

#bottom
# @views f[:, 1, 2] = f[:, 1, 4]
# @views f[:, 1, 5] = f[:, 1, 7]
# @views f[:, 1, 6] = f[:, 1, 8]
@views @. f[:, 1, 2] = f[:, 1, 4]
@views @. f[:, 1, 5] = f[:, 1, 7]
@views @. f[:, 1, 6] = f[:, 1, 8]
@views u[:, 1] .= 0.0
@views v[:, 1] .= 0.0

#top
# @views f[:, ny, 4] = f[:, ny, 2]
# @views f[:, ny, 8] = f[:, ny, 6]
# @views f[:, ny, 7] = f[:, ny, 5]
@views @. f[:, ny, 4] = f[:, ny, 2]
@views @. f[:, ny, 8] = f[:, ny, 6]
@views @. f[:, ny, 7] = f[:, ny, 5]
@views u[:, ny] .= 0.0
@views v[:, ny] .= 0.0
end


function macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
# ρ[:, :] = sum(f, dims = 3)
# u0[:, :] = u[:, :]
# v0[:, :] = v[:, :]
sum!(ρ, f)
copy!(u0, u)
copy!(v0, v)

for j = 1:ny, i = 1:nx
    uSum = 0.0
    vSum = 0.0
    for k = 1:Q
        uSum += f[i, j, k] * cx[k]
        vSum += f[i, j, k] * cy[k]
    end
    u[i, j] = uSum / ρ[i, j]
    v[i, j] = vSum / ρ[i, j]
end
# uv[:, :] = @. sqrt(u[:, :]^2 + v[:, :]^2)
@. uv = sqrt(u^2 + v^2)
end

# mainFlow(50)
using BenchmarkTools
@btime mainFlow(50)

内存分配计数：--track-allocation=user 输出见

.mem

        - using ShiftedArrays
        - 
        - function mainFlow(iterStepMax::Int64)
444211806 Q = 9 #D2Q9
        - nx = 1200 #lattice number in horizontal direction
        - ny = 600 #lattice number in vertical direction
        - U = 0.1 #inlet velocity
        - # cx = [1.0 0.0 -1.0 0.0 1.0 -1.0 -1.0 1.0 0.0] #velocity components in horizontal direction
        - # cy = [0.0 1.0 0.0 -1.0 1.0 1.0 -1.0 -1.0 0.0] #velocity components in vertical direction
      160 cx = [1 0 -1  0 1 -1 -1  1 0] #velocity components in horizontal direction
      160 cy = [0 1  0 -1 1  1 -1 -1 0] #velocity components in vertical direction
        - 
      160 w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters
        - 
        - # rel_err_temp = Array{Float64}(undef, nx, ny)
  5760080 rel_err_temp = ones(Float64, nx, ny)
  5760080 ρ = ones(Float64, nx, ny) #initial density
  5760080 u = zeros(Float64, nx, ny) #horizontal velocity component
  5760080 v = zeros(Float64, nx, ny) #vertical velocity component
  5760080 u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
  5760080 v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
        - # u0 = Array{Float64}(undef, nx, ny)
        - # v0 = Array{Float64}(undef, nx, ny)
  5760080 uv = zeros(Float64, nx, ny) #resultant velocity
 51840080 f = zeros(Float64, nx, ny, Q) #distribution function
 51840080 feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function
        - 
        - dx = 1.0 #horizontal lattice length
        - dy = 1.0 #vertical lattice length
        - lx = dx * nx #horizontal domain length
        - ly = dy * ny #vertical domain length
        - dt = dx #dt = 1.0
        - c = dx / dt #lattice sound speed
        - Re = 1000.0 #Reynoldz number
        - ν = U * ly / Re #kinematic viscousity
        - τ = 3.0ν + 0.5 #relaxation time
        - ω = 1.0 / τ #relaxation frequency
  4792848 println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
        - # iterStepMax = 10000
        - ε = 1.0e-6 #convergence requirement
        - 
        0 @views u[1, 2:ny-1] .= U
        0 for iterStep = 1:iterStepMax
        0     collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0     streaming(Q, cx, cy, f)
        0     boundaryCondition(nx, ny, u, v, ρ, U, f)
        0     macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        -     
        -     # temp1 = sum((u[:, :] - u0[:, :]) .^ 2 + (v[:, :] - v0[:, :]) .^ 2)
        -     # temp2 = sum(u .^ 2 + v .^ 2)
        -     # err = sqrt(temp1) / sqrt(temp2 + 1e-30)
        0     @. rel_err_temp = (u - u0)^2 + (v - v0)^2
        0     temp1 = sum(rel_err_temp)
        0     @. rel_err_temp = u^2 + v^2
        0     temp2 = sum(rel_err_temp)
        -     
        0     err = sqrt(temp1 / (temp2 + 1e-30))
        0     if err <= ε
        0         break
        -     end
        - end
        - 
        - end
        - 
        - function collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0 for j = 1:ny, i = 1:nx
        0     t1 = u[i, j] * u[i, j] + v[i, j] * v[i, j]
        0     for k = 1:Q
        0         t2 = u[i, j] * cx[k] + v[i, j] * cy[k]
        0         feq[i, j, k] = ρ[i, j] * w[k] * (1.0 + 3.0t2 + 4.5 * t2^2 - 1.5t1)
        0         f[i, j, k] = (1.0 - ω) * f[i, j, k] + ω * feq[i, j, k]
        -     end
        - end
        - end
        - 
        - function streaming(Q, cx, cy, f)
        0 for k = 1:Q
        -     # @views f[:, :, k] .= circshift(f[:, :, k], [cx[k], cy[k]])
        -     # @views f[:, :, k] .= circshift(f[:, :, k], (cx[k], cy[k]))
    72000     @views f[:, :, k] .= ShiftedArrays.circshift(f[:, :, k], (cx[k], cy[k]))
        - end
        - end
        - 
        - function boundaryCondition(nx, ny, u, v, ρ, U, f)
        - #left
        - # @views f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] .* U * 2.0 / 3.0
        - # @views f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
        - # @views f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 .* (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] .* U / 6.0
     9600 @views @. f[1, 2:ny-1, 1] = f[1, 2:ny-1, 3] + ρ[1, 2:ny-1] * U * 2.0 / 3.0
    16000 @views @. f[1, 2:ny-1, 5] = f[1, 2:ny-1, 7] - 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
    16000 @views @. f[1, 2:ny-1, 8] = f[1, 2:ny-1, 6] + 0.5 * (f[1, 2:ny-1, 2] - f[1, 2:ny-1, 4]) + ρ[1, 2:ny-1] * U / 6.0
        0 @views u[1, 2:ny-1] .= U
        0 @views v[1, 2:ny-1] .= 0.0
        - 
        - #right
        - # @views f[nx, :, 3] = f[nx-1, :, 3]
        - # @views f[nx, :, 6] = f[nx-1, :, 6]
        - # @views f[nx, :, 7] = f[nx-1, :, 7]
     6400 @views @. f[nx, :, 3] = f[nx-1, :, 3]
     6400 @views @. f[nx, :, 6] = f[nx-1, :, 6]
     6400 @views @. f[nx, :, 7] = f[nx-1, :, 7]
        - 
        - #bottom
        - # @views f[:, 1, 2] = f[:, 1, 4]
        - # @views f[:, 1, 5] = f[:, 1, 7]
        - # @views f[:, 1, 6] = f[:, 1, 8]
     6400 @views @. f[:, 1, 2] = f[:, 1, 4]
     6400 @views @. f[:, 1, 5] = f[:, 1, 7]
     6400 @views @. f[:, 1, 6] = f[:, 1, 8]
        0 @views u[:, 1] .= 0.0
        0 @views v[:, 1] .= 0.0
        - 
        - #top
        - # @views f[:, ny, 4] = f[:, ny, 2]
        - # @views f[:, ny, 8] = f[:, ny, 6]
        - # @views f[:, ny, 7] = f[:, ny, 5]
     6400 @views @. f[:, ny, 4] = f[:, ny, 2]
     6400 @views @. f[:, ny, 8] = f[:, ny, 6]
     6400 @views @. f[:, ny, 7] = f[:, ny, 5]
        0 @views u[:, ny] .= 0.0
     2400 @views v[:, ny] .= 0.0
        - end
        - 
        - 
        - function macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        - # ρ[:, :] = sum(f, dims = 3)
        - # u0[:, :] = u[:, :]
        - # v0[:, :] = v[:, :]
        0 sum!(ρ, f)
        0 copy!(u0, u)
        0 copy!(v0, v)
        - 
        0 for j = 1:ny, i = 1:nx
        -     uSum = 0.0
        -     vSum = 0.0
        0     for k = 1:Q
        0         uSum += f[i, j, k] * cx[k]
        0         vSum += f[i, j, k] * cy[k]
        -     end
        0     u[i, j] = uSum / ρ[i, j]
        0     v[i, j] = vSum / ρ[i, j]
        - end
        - # uv[:, :] = @. sqrt(u[:, :]^2 + v[:, :]^2)
        0 @. uv = sqrt(u^2 + v^2)
        - end
        - 
        - mainFlow(50)

tyfdzt · 2019 年12 月 23 日 14:17

我照着修改了，就像展示出来的一样，内存分配少了很多。但是出现了个问题，监测发现err的数值一直是不变的，也就是计算一直在重复第一步，经过加入一个定义的中间变量F(nx, ny, Q)就完全正常了。现在虽然内存还是有些微增长，但是和之前相比已经有了巨大的改观。

function streaming(Q, cx, cy, f, F)
    copy!(F, f)
    for k = 1:Q
        @views f[:, :, k] = ShiftedArrays.circshift(F[:, :, k], (cx[k], cy[k]))
    end
end

Roger · 2019 年12 月 23 日 16:21

View目前会在heap上分配，把这个在batch dimension上的view展开成loop会快一些。就能把这部分allocation去掉。

tyfdzt · 2019 年12 月 24 日 01:56

改写成loop以后，我稍微加了点输出到面板和画图的代码，如下

using ShiftedArrays, Plots

function mainFlow(iterStepMax::Int64)
    Q = 9 #D2Q9
    nx = 1200 #lattice number in horizontal direction
    ny = 600 #lattice number in vertical direction
    U = 0.1 #inlet velocity
    cx = [1 0 -1 0 1 -1 -1 1 0] #velocity components in horizontal direction
    cy = [0 1 0 -1 1 1 -1 -1 0] #velocity components in vertical direction
    w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters
    rel_err_temp = ones(Float64, nx, ny)
    ρ = ones(Float64, nx, ny) #initial density
    u = zeros(Float64, nx, ny) #horizontal velocity component
    v = zeros(Float64, nx, ny) #vertical velocity component
    u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
    v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
    uv = zeros(Float64, nx, ny) #resultant velocity
    f = zeros(Float64, nx, ny, Q) #distribution function
    F = zeros(Float64, nx, ny, Q)
    feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function
    dx = 1.0 #horizontal lattice length
    dy = 1.0 #vertical lattice length
    lx = dx * nx #horizontal domain length
    ly = dy * ny #vertical domain length
    dt = dx #dt = 1.0
    c = dx / dt #lattice sound speed
    Re = 1000.0 #Reynoldz number
    ν = U * ly / Re #kinematic viscousity
    τ = 3.0ν + 0.5 #relaxation time
    ω = 1.0 / τ #relaxation frequency
    println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
    ε = 1.0e-6 #convergence requirement

    @views u[1, 2:ny-1] .= U
    for iterStep = 1:iterStepMax
        collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        streaming(Q, cx, cy, f, F)
        boundaryCondition(nx, ny, u, v, ρ, U, f)
        macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        @. rel_err_temp = (u - u0)^2 + (v - v0)^2
        temp1 = sum(rel_err_temp)
        @. rel_err_temp = u^2 + v^2
        temp2 = sum(rel_err_temp)
        err = sqrt(temp1 / (temp2 + 1e-30))
        if fld(iterStep, 100) * 100 == iterStep #instead of "floor"
            println(
                "iterStep = $iterStep",
                ", residual = ",
                round(err, digits = 8),
                ", top-left velocity = ",
                round(uv[300, 450], digits = 8),
                ", bottom-left velocity = ",
                round(uv[300, 150], digits = 8),
                ", top-right velocity = ",
                round(uv[900, 450], digits = 8),
                ", bottom-right velocity = ",
                round(uv[900, 150], digits = 8),
            )
        end
        if err <= ε
            break
        end
    end
    clibrary(:colorcet)
    contourf(uv, color = :rainbow, aspect_ratio = 1.0)
end

function collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
    for j = 1:ny, i = 1:nx
        t1 = u[i, j] * u[i, j] + v[i, j] * v[i, j]
        for k = 1:Q
            t2 = u[i, j] * cx[k] + v[i, j] * cy[k]
            feq[i, j, k] = ρ[i, j] * w[k] * (1.0 + 3.0t2 + 4.5 * t2^2 - 1.5t1)
            f[i, j, k] = (1.0 - ω) * f[i, j, k] + ω * feq[i, j, k]
        end
    end
end

function streaming(Q, cx, cy, f, F)
    copy!(F, f)
    for k = 1:Q
        @views f[:, :, k] = ShiftedArrays.circshift(F[:, :, k], (cx[k], cy[k]))
    end
end

function boundaryCondition(nx, ny, u, v, ρ, U, f)
    #left and right
    for j = 2:ny-1
        f[1, j, 1] = f[1, j, 3] + ρ[1, j] * U * 2.0 / 3.0
        f[1, j, 5] = f[1, j, 7] - 0.5 * (f[1, j, 2] - f[1, j, 4]) + ρ[1, j] * U / 6.0
        f[1, j, 8] = f[1, j, 6] + 0.5 * (f[1, j, 2] - f[1, j, 4]) + ρ[1, j] * U / 6.0
        f[nx, j, 3] = f[nx-1, j, 3]
        f[nx, j, 6] = f[nx-1, j, 6]
        f[nx, j, 7] = f[nx-1, j, 7]
    end
    u[1, 2:ny-1] .= U
    v[1, 2:ny-1] .= 0.0

    #bottom and top
    for i = 1:nx
        f[i, 1, 2] = f[i, 1, 4]
        f[i, 1, 5] = f[i, 1, 7]
        f[i, 1, 6] = f[i, 1, 8]
        f[i, ny, 4] = f[i, ny, 2]
        f[i, ny, 8] = f[i, ny, 6]
        f[i, ny, 7] = f[i, ny, 5]
    end
    u[:, 1] .= 0.0
    v[:, 1] .= 0.0
    u[:, ny] .= 0.0
    v[:, ny] .= 0.0
end


function macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
    sum!(ρ, f)
    copy!(u0, u)
    copy!(v0, v)
    for j = 1:ny, i = 1:nx
        uSum = 0.0
        vSum = 0.0
        for k = 1:Q
            uSum += f[i, j, k] * cx[k]
            vSum += f[i, j, k] * cy[k]
        end
        u[i, j] = uSum / ρ[i, j]
        v[i, j] = vSum / ρ[i, j]
    end
    @. uv = sqrt(u^2 + v^2)
end

.mem文件

        - using ShiftedArrays, Plots
        - 
        - function mainFlow(iterStepMax::Int64)
211946304     Q = 9 #D2Q9
        -     nx = 1200 #lattice number in horizontal direction
        -     ny = 600 #lattice number in vertical direction
        -     U = 0.1 #inlet velocity
      160     cx = [1 0 -1 0 1 -1 -1 1 0] #velocity components in horizontal direction
      160     cy = [0 1 0 -1 1 1 -1 -1 0] #velocity components in vertical direction
      160     w = [1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 9.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 1.0 / 36.0 4.0 / 9.0] #weight parameters
  5760080     rel_err_temp = ones(Float64, nx, ny)
  5760080     ρ = ones(Float64, nx, ny) #initial density
  5760080     u = zeros(Float64, nx, ny) #horizontal velocity component
  5760080     v = zeros(Float64, nx, ny) #vertical velocity component
  5760080     u0 = zeros(Float64, nx, ny) #horizontal velocity component before iteration
  5760080     v0 = zeros(Float64, nx, ny) #vertical velocity component befre iteration
  5760080     uv = zeros(Float64, nx, ny) #resultant velocity
 51840080     f = zeros(Float64, nx, ny, Q) #distribution function
 51840080     F = zeros(Float64, nx, ny, Q)
 51840080     feq = zeros(Float64, nx, ny, Q) #equilibrium distribution function
        -     dx = 1.0 #horizontal lattice length
        -     dy = 1.0 #vertical lattice length
        -     lx = dx * nx #horizontal domain length
        -     ly = dy * ny #vertical domain length
        -     dt = dx #dt = 1.0
        -     c = dx / dt #lattice sound speed
        -     Re = 1000.0 #Reynoldz number
        -     ν = U * ly / Re #kinematic viscousity
        -     τ = 3.0ν + 0.5 #relaxation time
        -     ω = 1.0 / τ #relaxation frequency
  4941680     println("Re = ", Re, ", τ = ", τ) #instead of "Re = $Re, τ = $τ"
        -     ε = 1.0e-6 #convergence requirement
        - 
        0     @views u[1, 2:ny-1] .= U
        0     for iterStep = 1:iterStepMax
        0         collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0         streaming(Q, cx, cy, f, F)
        0         boundaryCondition(nx, ny, u, v, ρ, U, f)
        0         macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        0         @. rel_err_temp = (u - u0)^2 + (v - v0)^2
        0         temp1 = sum(rel_err_temp)
        0         @. rel_err_temp = u^2 + v^2
        0         temp2 = sum(rel_err_temp)
        0         err = sqrt(temp1 / (temp2 + 1e-30))
        0         if fld(iterStep, 100) * 100 == iterStep #instead of "floor"
        0             println(
        -                 "iterStep = $iterStep",
        -                 ", residual = ",
        -                 round(err, digits = 8),
        -                 ", top-left velocity = ",
        -                 round(uv[300, 450], digits = 8),
        -                 ", bottom-left velocity = ",
        -                 round(uv[300, 150], digits = 8),
        -                 ", top-right velocity = ",
        -                 round(uv[900, 450], digits = 8),
        -                 ", bottom-right velocity = ",
        -                 round(uv[900, 150], digits = 8),
        -             )
        -         end
        0         if err <= ε
        0             break
        -         end
        -     end
        0     clibrary(:colorcet)
       64     contourf(uv, color = :rainbow, aspect_ratio = 1.0)
        - end
        - 
        - function collision(nx, ny, Q, u, v, ρ, cx, cy, w, ω, f, feq)
        0     for j = 1:ny, i = 1:nx
        0         t1 = u[i, j] * u[i, j] + v[i, j] * v[i, j]
        0         for k = 1:Q
        0             t2 = u[i, j] * cx[k] + v[i, j] * cy[k]
        0             feq[i, j, k] = ρ[i, j] * w[k] * (1.0 + 3.0t2 + 4.5 * t2^2 - 1.5t1)
        0             f[i, j, k] = (1.0 - ω) * f[i, j, k] + ω * feq[i, j, k]
        -         end
        -     end
        - end
        - 
        - function streaming(Q, cx, cy, f, F)
        0     copy!(F, f)
        0     for k = 1:Q
    43200         @views f[:, :, k] = ShiftedArrays.circshift(F[:, :, k], (cx[k], cy[k]))
        -     end
        - end
        - 
        - function boundaryCondition(nx, ny, u, v, ρ, U, f)
        -     #left and right
        0     for j = 2:ny-1
        0         f[1, j, 1] = f[1, j, 3] + ρ[1, j] * U * 2.0 / 3.0
        0         f[1, j, 5] = f[1, j, 7] - 0.5 * (f[1, j, 2] - f[1, j, 4]) + ρ[1, j] * U / 6.0
        0         f[1, j, 8] = f[1, j, 6] + 0.5 * (f[1, j, 2] - f[1, j, 4]) + ρ[1, j] * U / 6.0
        0         f[nx, j, 3] = f[nx-1, j, 3]
        0         f[nx, j, 6] = f[nx-1, j, 6]
        0         f[nx, j, 7] = f[nx-1, j, 7]
        -     end
        0     u[1, 2:ny-1] .= U
        0     v[1, 2:ny-1] .= 0.0
        - 
        -     #bottom and top
        0     for i = 1:nx
        0         f[i, 1, 2] = f[i, 1, 4]
        0         f[i, 1, 5] = f[i, 1, 7]
        0         f[i, 1, 6] = f[i, 1, 8]
        0         f[i, ny, 4] = f[i, ny, 2]
        0         f[i, ny, 8] = f[i, ny, 6]
        0         f[i, ny, 7] = f[i, ny, 5]
        -     end
        0     u[:, 1] .= 0.0
        0     v[:, 1] .= 0.0
        0     u[:, ny] .= 0.0
     2400     v[:, ny] .= 0.0
        - end
        - 
        - 
        - function macroQuantity(nx, ny, Q, u, v, uv, u0, v0, ρ, cx, cy, f)
        0     sum!(ρ, f)
        0     copy!(u0, u)
        0     copy!(v0, v)
        0     for j = 1:ny, i = 1:nx
        -         uSum = 0.0
        -         vSum = 0.0
        0         for k = 1:Q
        0             uSum += f[i, j, k] * cx[k]
        0             vSum += f[i, j, k] * cy[k]
        -         end
        0         u[i, j] = uSum / ρ[i, j]
        0         v[i, j] = vSum / ρ[i, j]
        -     end
        0     @. uv = sqrt(u^2 + v^2)
        - end
        -

除了 2400 v[:, ny] .= 0.0 有点不一致以外，又改善了一些
性能测试如下

@benchmark mainFlow(50)
BenchmarkTools.Trial:
  memory estimate:  197.99 MiB
  allocs estimate:  3231
  --------------
  minimum time:     4.342 s (3.28% GC)
  median time:      4.362 s (1.65% GC)
  mean time:        4.362 s (1.65% GC)
  maximum time:     4.382 s (0.03% GC)
  --------------
  samples:          2
  evals/sample:     1

@benchmark mainFlow(100)
BenchmarkTools.Trial: 
  memory estimate:  198.04 MiB
  allocs estimate:  4244
  --------------
  minimum time:     8.771 s (0.01% GC)
  median time:      8.771 s (0.01% GC)
  mean time:        8.771 s (0.01% GC)
  maximum time:     8.771 s (0.01% GC)
  --------------
  samples:          1
  evals/sample:     1

Roger · 2019 年12 月 24 日 19:30

index的时候要加inbounds。好好读下performance tips再？看看有没有没有遵守的tips

tyfdzt · 2019 年12 月 25 日 04:51

加了inbounds以后，速度提升了四分之一。

woclass · 2019 年12 月 25 日 15:50

看到一个老包或许能参考一下：

grasingerm/lbxflow: Lattice Boltzmann method implemented in julia

可以尝试把他升级到 1.0

henry2004y · 2019 年12 月 26 日 14:52

这个包我研究过一点，写得是很不错，但是由于是老版本没法运行，而且结构设计比较复杂，不是很容易明白内部的结构。原作者明显是C++出身，文档里有在老版本上如何使用的介绍，但没有代码本身的结构叙述。但我想如果能明白，也会是很大的提高。

看到以上几位的讨论深有感触，自己一个个踩过的坑，历历在目。Julia目前存在很多这样的问题，由于她的新，必然也会有很多的代码不稳定和写法的不统一。比如用@views导致在heap上分配内存的问题，不稍微了解深入一些对于非计算机方向的人来说就有门槛。

我在Github上曾经放过一个整理和修改的以别人写的简单的Matlab为主的LBM仓库https://github.com/henry2004y/LBMD2Q9。

现在刚整理出来一个简单的Julia包：https://github.com/henry2004y/LBM_tutorial 这主要是我学习LBM过程中的一些练习，有些部分比如不可压流体的2D还没空完成，索性一起放出来了，仅供参考。关于性能的部分我这里没有仔细考虑，所以可能还有很多值得提升的地方。但显然若要论性能，LBM的Julia并行实现才是更应该探索的方向。我目前的感觉是Julia并行方式虽然多，但都不是很完善，也很不容易深入下去。

henry2004y · 2019 年12 月 26 日 14:57

一个小的风格建议：Julia的习惯是，如果对输入参数进行了改动，在函数最后应加上！。

dong2221 · 2020 年5 月 1 日 04:23

弱弱地问一句，mem的数据是怎么获得的？手册上的我试了，不行阿。

woclass · 2020 年5 月 1 日 04:59

用命令行运行，加 --track-allocation 参数。
会在同文件夹下生成 .mem 文件