julia计算人类基因组GC含量

兄弟们,我用Julia写了一个计算人类基因组hg38的GC含量的脚本,
让我惊讶的是,它的性能居然和perl是一样的,不应该大大超过perl吗?

function lineGC(seq::String)
    GCnumber=count(x->(x=='G'||x=='C'||x=='g'||x=='c'),seq)
    lineNum=count(x->(x!='N' && x!='n'),seq)
    (GCnumber,lineNum)
end

function calGC(fs)
    GCnumber=zero(Int)
    lineNum=zero(Int)
    open(fs,"r") do IOstream
        for line in eachline(IOstream)
            if startswith(line,">")
                continue
            else
                GC,all=lineGC(line)
                GCnumber+=GC
                lineNum+=all
            end
        end
    end
    round(GCnumber/lineNum;digits=3)
end

println("GC content: ",calGC(ARGS[1]))

请问,为什么我的性能那么差

julia
perl

有没有数据拿给大家伙试一下。

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
解压即可

@code_warntype 看了下类型,不是很稳定。我给改稳定后,结果更慢了。

内存分配计数显示大头在累加的两行,就很奇怪。

GCnumber+=GC
lineNum+=all

火焰图


类型不稳是为什么呢,我lineGC函数返回的只是count计数的结果,应该Int64,GCnumber和lineNum应该也是Int64;

为什么不在repl下试试呢,

@btime your_code

没有看到有@btime宏,我是为了比较julia和perl的速度才用的time

我试了下, 我把

open(fs,"r") do IOstream

去掉了,速度提升了3秒。

function lineGC(seq::String)
    GCnumber=count(x->(x=='G'||x=='C'||x=='g'||x=='c'),seq)
    lineNum=count(x->(x!='N' && x!='n'),seq)
    return GCnumber,lineNum
end

function calGC(fs)
    GCnumber=zero(Int)
    lineNum=zero(Int)
        for line in eachline(fs)
            if !startswith(line,">")
                GC,all=lineGC(line)
                GCnumber+=GC
                lineNum+=all
            end
    end
    round(GCnumber/lineNum;digits=3)
end
julia> @benchmark calGC("hg38.fa")
BenchmarkTools.Trial: 
  memory estimate:  5.75 GiB
  allocs estimate:  128470738
  --------------
  minimum time:     20.377 s (1.95% GC)
  median time:      20.377 s (1.95% GC)
  mean time:        20.377 s (1.95% GC)
  maximum time:     20.377 s (1.95% GC)
  --------------
  samples:          1
  evals/sample:     1

1赞

哈哈,去掉了open后,在我这里时间增加了2秒。而用来比较的perl代码,是整个循环都考虑了的

#!/usr/bin/perl -w
  
use strict;

if(@ARGV < 1){
    die "Usage : perl $0 <genome.fa>\n";
}

my $input = shift @ARGV;

my ($sum,$G_num,$C_num,$N_num)=(0,0,0,0);
my $id;
open IN, "< $input" or die $!;
while(my $line = <IN>){
    chomp $line;
    if($line =~ />([^\s]+)/){
        $id = $1;
    }else{
        $sum += length($line);
        $G_num += ($line =~ tr/Gg/Gg/);
        $C_num += ($line =~ tr/Cc/Cc/);
        $N_num += ($line =~ tr/Nn/Nn/);
    }
}
close IN;

my $GC_rate = ($G_num+$C_num)/($sum-$N_num);
printf "GC content: %.3f \n",$GC_rate;

所以我学习julia是为了提升性能,结果和一门古老的动态语言没差

using BenchmarkTools
@btime your_code
@benchmark your_code

Julia运行的时候好像要预热一下,才能发挥正常的速度,所以才需要Benchmark这样的工具
我不能确定perl是否需要预热
最后
image

1赞

:grin: perl纯解释语言,和python和R一样不需要编译。所以和julia速度一致就很皮

我们还不知道是哪里出了问题,还是把意见保留一下吧 :yum:

lineGC(seq::String) 重写了,又提升了3秒。

function lineGC(seq::String)
    GCnumber = 0
    lineNum = 0
    for char in seq
        lineNum += 1
        if char == 'c' || char == 'g' || char == 'C' || char == 'G'
            GCnumber += 1
        elseif char == 'n' ||  char == 'N'
            lineNum -= 1
        end
    end
    return GCnumber, lineNum
end
~  time julia main1.jl hg38.fa # 你的原始版本
GC content: 0.41
julia main1.jl hg38.fa  22.99s user 0.65s system 100% cpu 23.598 total

~  time julia main.jl hg38.fa # 改了之后
GC content: 0.41
julia main.jl hg38.fa  16.55s user 0.60s system 100% cpu 17.108 total

wc

wo@wos-ThinkPad:/mnt/h$ time wc hg38.fa
  64186394   64186394 3273481150 hg38.fa

real    0m47.166s
user    0m41.813s
sys     0m2.719s

julia

原始代码

julia 原始测试代码
function lineGC(seq::String)
    GCnumber=count(x->(x=='G'||x=='C'||x=='g'||x=='c'),seq)
    lineNum=count(x->(x!='N' && x!='n'),seq)
    (GCnumber,lineNum)
end

function calGC(fs)
    GCnumber=zero(Int)
    lineNum=zero(Int)
    open(fs,"r") do IOstream
        for line in eachline(IOstream)
            if startswith(line,">")
                continue
            else
                GC,all=lineGC(line)
                GCnumber+=GC
                lineNum+=all
            end
        end
    end
    # round(GCnumber/lineNum;digits=3)
    GCnumber, lineNum
end

# println("GC content: ",calGC(ARGS[1]))
function main()
    path = raw"H:\hg38.fa"
    GCnumber, lineNum = calGC(path)
    println("GC=$(GCnumber); total=$(lineNum); frac=", GCnumber/lineNum)
end
julia> include(raw"C:\Users\woclass\Desktop\GitHub\gc\julia\JuliaCN\4351-jl-perl\4351-ori.jl")
main (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime main()
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
  43.173 s (256842214 allocations: 7.67 GiB)

julia> @time main()
GC=1250062479; total=3049315783; frac=0.409948515653618
 44.693519 seconds (256.85 M allocations: 7.668 GiB, 0.79% gc time)
内存计数
        - function lineGC(seq::String)
        -     GCnumber=count(x->(x=='G'||x=='C'||x=='g'||x=='c'),seq)
        -     lineNum=count(x->(x!='N' && x!='n'),seq)
        -     (GCnumber,lineNum)
        - end
        - 
        - function calGC(fs)
        -     GCnumber=zero(Int)
        -     lineNum=zero(Int)
        -     open(fs,"r") do IOstream
       48         for line in eachline(IOstream)
        0             if startswith(line,">")
        -                 continue
        -             else
        0                 GC,all=lineGC(line)
1026971568                 GCnumber+=GC
7206074416                 lineNum+=all
        -             end
        -         end
        -     end
        -     # round(GCnumber/lineNum;digits=3)
        -     GCnumber, lineNum
        - end
        - 
        - # println("GC content: ",calGC(ARGS[1]))
        - function main()
        -     path = raw"H:\hg38.fa"
       64     GCnumber, lineNum = calGC(path)
      496     println("GC=$(GCnumber); total=$(lineNum); frac=", GCnumber/lineNum)
        - end
        - 
        - main()

类型稳定测试

不知道为啥装箱了

julia> @code_warntype calGC("H:\\")
Variables
  #self#::Core.Compiler.Const(calGC, false)
  fs::String
  #5::var"#5#6"
  GCnumber@_4::Core.Box
  lineNum@_5::Core.Box
  GCnumber@_6::Union{}
  lineNum@_7::Union{}

Body::Tuple{Any,Any}
1 ─       (GCnumber@_4 = Core.Box())
│         (lineNum@_5 = Core.Box())
│   %3  = Main.zero(Main.Int)::Core.Compiler.Const(0, false)
│         Core.setfield!(GCnumber@_4, :contents, %3)
│   %5  = Main.zero(Main.Int)::Core.Compiler.Const(0, false)
│         Core.setfield!(lineNum@_5, :contents, %5)
│         (#5 = %new(Main.:(var"#5#6"), GCnumber@_4, lineNum@_5))
│   %8  = #5::var"#5#6"
│         Main.open(%8, fs, "r")
│   %10 = Core.isdefined(GCnumber@_4, :contents)::Bool
└──       goto #3 if not %10
2 ─       goto #4
3 ─       Core.NewvarNode(:(GCnumber@_6))
└──       GCnumber@_6
4 ┄ %15 = Core.getfield(GCnumber@_4, :contents)::Any
│   %16 = Core.isdefined(lineNum@_5, :contents)::Bool
└──       goto #6 if not %16
5 ─       goto #7
6 ─       Core.NewvarNode(:(lineNum@_7))
└──       lineNum@_7
7 ┄ %21 = Core.getfield(lineNum@_5, :contents)::Any
│   %22 = Core.tuple(%15, %21)::Tuple{Any,Any}
└──       return %22

直接使用 eachline 读文件

修改后代码
function lineGC(seq::String)
    GCnumber=count(x->(x=='G'||x=='C'||x=='g'||x=='c'),seq)
    lineNum=count(x->(x!='N' && x!='n'),seq)
    (GCnumber,lineNum)
end

function calGC(fs)
    GCnumber=zero(Int)
    lineNum=zero(Int)

    for line in eachline(fs)
        if startswith(line,">")
            continue
        else
            GC,all=lineGC(line)
            GCnumber+=GC
            lineNum+=all
        end
    end
    
    # round(GCnumber/lineNum;digits=3)
    GCnumber, lineNum
end

# println("GC content: ",calGC(ARGS[1]))
function main()
    path = raw"H:\hg38.fa"
    GCnumber, lineNum = calGC(path)
    println("GC=$(GCnumber); total=$(lineNum); frac=", GCnumber/lineNum)
end

julia> @btime main()
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
  38.196 s (128470761 allocations: 5.75 GiB)

julia> @time main()
GC=1250062479; total=3049315783; frac=0.409948515653618
 46.266453 seconds (128.47 M allocations: 5.755 GiB, 0.70% gc time)
内存计数
        - function lineGC(seq::String)
        -     GCnumber=count(x->(x=='G'||x=='C'||x=='g'||x=='c'),seq)
        -     lineNum=count(x->(x!='N' && x!='n'),seq)
        -     (GCnumber,lineNum)
        - end
        - 
        - function calGC(fs)
        -     GCnumber=zero(Int)
        -     lineNum=zero(Int)
        - 
      800     for line in eachline(fs)
        0         if startswith(line,">")
        -             continue
        -         else
        0             GC,all=lineGC(line)
        0             GCnumber+=GC
6179102752             lineNum+=all
        -         end
        -     end
        -     
        -     # round(GCnumber/lineNum;digits=3)
        0     GCnumber, lineNum
        - end
        - 
        - # println("GC content: ",calGC(ARGS[1]))
        - function main()
        -     path = raw"H:\hg38.fa"
        0     GCnumber, lineNum = calGC(path)
      512     println("GC=$(GCnumber); total=$(lineNum); frac=", GCnumber/lineNum)
        - end
        - 
        - main()

类型稳定测试
julia> @code_warntype calGC("H:\\")
Variables
  #self#::Core.Compiler.Const(calGC, false)
  fs::String
  GCnumber::Int64
  lineNum::Int64
  @_5::Union{Nothing, Tuple{String,Nothing}}
  line::String
  GC::Int64
  @_8::Int64
  all::Int64

Body::Tuple{Int64,Int64}
1 ─       (GCnumber = Main.zero(Main.Int))
│         (lineNum = Main.zero(Main.Int))
│   %3  = Main.eachline(fs)::Base.EachLine{IOStream}
│         (@_5 = Base.iterate(%3))
│   %5  = (@_5 === nothing)::Bool
│   %6  = Base.not_int(%5)::Bool
└──       goto #7 if not %6
2 ┄       Core.NewvarNode(:(GC))
│         Core.NewvarNode(:(@_8))
│         Core.NewvarNode(:(all))
│   %11 = @_5::Tuple{String,Nothing}::Tuple{String,Nothing}
│         (line = Core.getfield(%11, 1))
│   %13 = Core.getfield(%11, 2)::Core.Compiler.Const(nothing, false)
│   %14 = Main.startswith(line, ">")::Bool
└──       goto #4 if not %14
3 ─       goto #5
4 ─ %17 = Main.lineGC(line)::Tuple{Int64,Int64}
│   %18 = Base.indexed_iterate(%17, 1)::Core.Compiler.PartialStruct(Tuple{Int64,Int64}, Any[Int64, Core.Compiler.Const(2, false)])
│         (GC = Core.getfield(%18, 1))
│         (@_8 = Core.getfield(%18, 2))
│   %21 = Base.indexed_iterate(%17, 2, @_8::Core.Compiler.Const(2, false))::Core.Compiler.PartialStruct(Tuple{Int64,Int64}, Any[Int64, Core.Compiler.Const(3, false)])
│         (all = Core.getfield(%21, 1))
│         (GCnumber = GCnumber + GC)
└──       (lineNum = lineNum + all)
5 ┄       (@_5 = Base.iterate(%3, %13))
│   %26 = (@_5 === nothing)::Bool
│   %27 = Base.not_int(%26)::Bool
└──       goto #7 if not %27
6 ─       goto #2
7 ┄ %30 = Core.tuple(GCnumber, lineNum)::Tuple{Int64,Int64}
└──       return %30

改了之后内存分配计数基本下降一半,但时间上的占比依旧很小。
问题出在计算部分。

用数组统计

julia 代码

function countAll!(arr::Vector{Int64}, seq::AbstractString)
    for c in seq
        arr[Int(c)] += 1
    end
end

function countChar(fname)
    arr = zeros(Int64, 128)

    for line in eachline(fname)
        if startswith(line,">")
            continue
        else
            countAll!(arr, line)
        end
    end

    arr
end

function main()
    path = raw"H:\hg38.fa"
    arr = countChar(path)
    G = arr[Int('G')] + arr['g' |> Int]
    C = arr['C'|> Int] + arr['c'|> Int]
    total = G+C + arr['A'|> Int] + arr['a'|> Int] +arr['T'|> Int] + arr['t'|> Int]
    println("GC=$(G+C); total=$(total); frac=", (G+C)/total)
end

main()
julia> @btime main()
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
GC=1250062479; total=3049315783; frac=0.409948515653618
  27.584 s (128470762 allocations: 5.75 GiB)

julia> @time main()
GC=1250062479; total=3049315783; frac=0.409948515653618
 29.422292 seconds (128.47 M allocations: 5.755 GiB, 0.84% gc time)
类型稳定性
julia> @code_warntype countChar("H:\\")
Variables
  #self#::Core.Compiler.Const(countChar, false)
  fname::String
  arr::Array{Int64,1}
  @_4::Union{Nothing, Tuple{String,Nothing}}
  line::String

Body::Array{Int64,1}
1 ─       (arr = Main.zeros(Main.Int64, 128))
│   %2  = Main.eachline(fname)::Base.EachLine{IOStream}
│         (@_4 = Base.iterate(%2))
│   %4  = (@_4 === nothing)::Bool
│   %5  = Base.not_int(%4)::Bool
└──       goto #7 if not %5
2 ┄ %7  = @_4::Tuple{String,Nothing}::Tuple{String,Nothing}
│         (line = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Core.Compiler.Const(nothing, false)
│   %10 = Main.startswith(line, ">")::Bool
└──       goto #4 if not %10
3 ─       goto #5
4 ─       Main.countAll!(arr, line)
5 ┄       (@_4 = Base.iterate(%2, %9))
│   %15 = (@_4 === nothing)::Bool
│   %16 = Base.not_int(%15)::Bool
└──       goto #7 if not %16
6 ─       goto #2
7 ┄       return arr

java

用了一个 if-else 和一个数组统计的方法。

java 代码

baidu 刚刚上手,写的比较挫

package com.company;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class Main {
    static long[] freqs = new long[128];
    static long G = 0;
    static long C = 0;
    static double total = 0.0;

    public static void countLettersIf(String filename) throws IOException{
        try(BufferedReader in = new BufferedReader(new FileReader(filename))){
            String line;
            while((line = in.readLine()) != null){
                for(char ch:line.toCharArray()){
                    if (ch=='>') {
                        break;
                    } else if(ch=='G' || ch=='g') {
                        G++;
                    } else if(ch=='C' || ch=='c'){
                        C++;
                    }
                    if (ch!='N' && ch!='n') {
                        total++;
                    }
                }
            }
        }
    }

    public static long[] countLettersArray(String filename) throws IOException{
        try(BufferedReader in = new BufferedReader(new FileReader(filename))){
            String line;
            while((line = in.readLine()) != null){
                for(char ch:line.toCharArray()){
                    if (ch=='>') {
                        break;
                    }
                    freqs[ch]++;
                }
            }
        }
        return freqs;
    }

    public static void printRes() {
        System.out.println("G=" + G + "; C=" + C + "; GC=" + (G+C));
        System.out.println("total=" + (long) total);
        System.out.println("frac=" + ((G+C) / total));
    }

    public static void main(String[] args) throws IOException{
        long starTime = System.currentTimeMillis();
        countLettersIf("H:\\hg38.fa");
        long endTime = System.currentTimeMillis();
        long Time = endTime - starTime;
        System.out.println("if-else style Time=" + Time/1000.0 + "s");
        printRes();

        System.out.println("----------------");

        starTime = System.currentTimeMillis();
        countLettersArray("H:\\hg38.fa");
        endTime = System.currentTimeMillis();
        Time = endTime - starTime;
        System.out.println("array style Time=" + Time/1000.0 + "s");
        G = freqs['G'] + freqs['g'];
        C = freqs['C'] + freqs['c'];
        total = G+C+freqs['A'] + freqs['a']+freqs['T'] + freqs['t'];
        printRes();
    }
}
if-else style Time=42.075s
G=626335137; C=623727342; GC=1250062479
total=3049315783
frac=0.409948515653618
----------------
array style Time=17.056s
G=626335137; C=623727342; GC=1250062479
total=3049315783
frac=0.409948515653618

1赞

感觉新写的lineGC完全就是个c语言版的函数,julia内置的函数用了影响性能我是没想到的

你的数据有 64186394 行,原本的 count 用了两次,相当于每行过了两遍。自己写的时候只过了一遍,应该是要快些的。

做了一个比较快的实现,但是需要再减去开头>行的一些字符;用string读取会ASCII不是一个比较好的选择。

julia> function countChars(io::IO; eol::AbstractChar='\n')
           a = Vector{UInt8}(undef, 8192)
           nb = 0; arr = zeros(Int, 128)
           while !eof(io)
               nb = readbytes!(io, a)
               @simd for i=1:nb
                   @inbounds  arr[a[i]] += 1
               end
           end
           arr
       end

julia> countChars(f::AbstractString; eol::AbstractChar = '\n') = open(io->countChars(io, eol = eol), f)

julia> @time res = countChars(fs);
  4.266670 seconds (14 allocations: 10.062 KiB)

julia> res['C' |> Int] + res['c' |> Int]
623727797

结果略有偏差,因为无差别统计了>行的字符,可以尝试用其他快速的方法去掉
这个实现参考的是Julia的countlines函数,速度和wc -l相当,有个冗余的参数懒得去掉了……

事实上这个循环的消耗已经有10s了

julia> @time for i in eachline("hg38.fa")
           if startswith(i, '>')
               a[1] += 1
           end
       end
 10.926765 seconds (128.47 M allocations: 5.755 GiB, 4.04% gc time)

看了大神们java的实现以及各种算法的优化。其实日常随手一写在不优化算法的情况下,最初的版本已经是比较快的了。同时也证明了,perl作为一门古老的动态语言,在文本处理上确实有其独到之处

上边讲的去掉多余的>字符也可以用类似的方法,添加一个bool flag,遇到>置true,\n置false,实现起来应该也会比较快

京ICP备17009874号-2