总线错误(Bus error (core dumped))?

想请教一下各位小伙伴,在julia中如何解决总线错误的问题。

我在循环迭代一个大的Vector,每个element内存都很大(1-2GB),Vector长度为120。因此我写了如下的类

struct Quasi_Vector <: AbstractVector
    path :: String
    header :: String
    sufflix::String
    L :: Int
end

在下这个向量在硬盘中的地址还有文件名,然后用重载以下函数去读取和更新

Base.getindex(ql::Quasi_Vector, i::Int) = deserialize(joinpath(ql.path, ql.header * string(i) * "." * ql.sufflix))
Base.setindex!(ql::Quasi_Vector, value::TensorMap, i::Int) = serialize(joinpath(ql.path, ql.header * string(i) * "." * ql.sufflix), value)
Base.length(ql::Quasi_Vector) = ql.L
Base.eachindex(ql::Quasi_Vector) = 1:ql.L

但是运行一段时间之后出现了bus error,具体信息为

[1719211] signal (7.2): Bus error
in expression starting at /home/minlo/boson_haldane/n_1div2_todisk/run_dmrg_Dorder_savespace.jl:273
/var/spool/slurm/d/job1867643/slurm_script: line 41: 1719211 Bus error               (core dumped) julia $1 $2

我的Julia版本是1.9.4, 在cluster上运行的程序

多线程还是单线程

1 个赞

是单线程,这个程序是计算物理密中度矩阵重整化群(dmrg)的计算,我暂时还没能拿出一个minimal example

julia线程数是1,但是我把BLAS的线程数设置成了10

单进程单线程吗, 存在同时写 path 的情况吗

是的,单线程单进程。我用这样的流程去更新这个两个quasi_vector类,psi0和PH。

for sweep in 1:nsweep
    for i in 1:1:N-1
        psi0[i],psi0[i+1] =optimize_two_site(PH[i],PH[i+2],psi0[i],psi0[i+1],"right")
        PH[i+1] = new_PHL(PH[i],psi0[i])
        GC.gc()
    end
    for i in N-1:-1:1
        tic = time()
        psi0[i],psi0[i+1] =optimize_two_site(PH[i],PH[i+2],psi0[i],psi0[i+1],"left")
        PH[i+1] = new_PHR(PH[i+2],psi0[i+1])
        GC.gc()
    end
end

从代码上看,在做optimized two site的时候应该会先执行等号右边的程序,进行读取,再执行左边,进行写入。

如果我的理解没有错的话,应该不会有同时读写的情况发生,整个迭代流程是线性的,主程序没有并行

psi0和PH中的element是一个个tensor。tensor的维度会随着sweep变大,程序是在程序运行中途的时候出bus error的,此时的tensor已经比一开始大了很多。

我检查过程序的RAM(200G)还有hard disk 的quota(2T),都是足够的,应该不会是由内存导致的才对

可以看看看 sf 找找有没有相似的情况。

c - What is a bus error? Is it different from a segmentation fault? - Stack Overflow

英文论坛搜 bus error 的帖子更多一些。
翻了一下感觉是共享内存容易引起问题,mmap 文件读写也是。

是的,我这里相当于写了个简单的mmp。第二次运行的时候出现了比较多的报错信息

[1879238] signal (7.2): Bus error
in expression starting at /home/minlo/boson_haldane/n_1div2_todisk/run_dmrg_Dorder_savespace.jl:273
ht_keyindex2_shorthash! at ./dict.jl:299
getindex at ./essentials.jl:13 [inlined]
iterate at ./array.jl:893 [inlined]
Path at /home/minlo/.julia/packages/RelocatableFolders/URtlI/src/RelocatableFolders.jl:72
- at ./int.jl:86 [inlined]
length at ./range.jl:751 [inlined]
axes at ./range.jl:696 [inlined]
axes at ./generator.jl:52 [inlined]
_similar_shape at ./array.jl:659
#plot!#199 at /home/minlo/.julia/packages/Plots/kLeqV/src/plot.jl:211
plot! at /home/minlo/.julia/packages/Plots/kLeqV/src/plot.jl:208
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
do_call at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1903
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1963
include at ./Base.jl:457
jfptr_include_40628.clone_1 at /home/minlo/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
exec_options at ./client.jl:307
_start at ./client.jl:522
jfptr__start_49509.clone_1 at /home/minlo/.julia/juliaup/julia-1.9.4+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
true_main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-9/cli/loader_exe.c:59
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 1454829914 (Pool: 1452236604; Big: 2593310); GC: 97809
/var/spool/slurm/d/job1869797/slurm_script: line 41: 1879238 Bus error               (core dumped) julia $1 $2

感到非常的疑惑

我修复过一个mmap 文件小于要求的大小的bus error
Mmap bus error · Issue #28245 · JuliaLang/julia · GitHub , 有条件的话先试下1.10看看可能会定位到行

我的代码其实又不是用的mmap只是自己简单地手写了一个内存映射。不过anyway我可以试试换一个版本看看能不能行得通,谢谢你的建议