请问Julia读取文本文件有的快速方法么？

AquaIndigo · 2020 年5 月 20 日 03:32

目标是读取一个文本中的两列数字（这里是4,5列），文本是制表符分割的，可能有空行，第一行可能是表头，但是实际的读取时间有些慢（470万行的数据需要4～5s，而C/C++能在0.6s左右读完），请问有什么好的办法可以提高读取文件的速度么？

function read_file!(file_name::String, file_len::Int64,
     start_at::Array{Int64,1}, end_at::Array{Int64,1})
    f = open(file_name, "r")
    for i  = 1 : file_len
        tmp = readline(f)
        if tmp != ""
            str = split(tmp,"\t")[4:5]
            start_at[i] = tryparse(Int, str[1])
            end_at[i] = tryparse(Int, str[2])
        end
    end
    close(f)
    if start_at[1] == nothing
        start_at[1] = -1
        end_at[1] = -1
    end
end

nesteiner · 2020 年5 月 20 日 04:22

你能贴出C++代码吗，我想看看怎么对比
顺便说一句，测试时间的话用
using BenchmarkTools
@btime code

AquaIndigo · 2020 年5 月 20 日 05:25

你好，我好像有些记错了，C++的时间大约是1s，代码如下：

struct coord {
    int x_, y_, line_num_;

    coord(int _line_num, int _x = 0, int _y = 0) : x_(_x), y_(_y), line_num_(_line_num) {}

    friend bool operator<(const coord &lhs, const coord &rhs) {
        return (lhs.x_ == rhs.x_) ? (lhs.y_ < rhs.y_) : (lhs.x_ < rhs.x_);
    }
};

void input_coord(string_view file_name, int col_1, int col_2, vector<coord> &_coord_container) {
    string buffer;
    ifstream fin(string{file_name});
    int n = -1;
    if (!fin.is_open()) {
        cout << "Open failed!" << endl;
        exit(1);
    }
    while (getline(fin, buffer) && !fin.eof()) {
        if (buffer.empty()) continue;
        int counter(-1);
        size_t pos1(0);

        while (++counter < col_1)
            pos1 = buffer.find('\t', pos1 + 1);
        auto pos2(buffer.find('\t', pos1 + 1));
        string_view svr1(buffer.substr(pos1 + 1, pos2 - pos1 - 1));
        pos1 = pos2;
        while (++counter < col_2)
            pos1 = buffer.find('\t', pos1 + 1);
        pos2 = buffer.find('\t', pos1 + 1);
        string_view svr2(buffer.substr(pos1 + 1, pos2 - pos1 - 1));
        if (svr1[0] >= '0' && svr1[0] <= '9')
            _coord_container.push_back(coord{++n, stoi(string{svr1}), stoi(string{svr2})});
        else
            _coord_container.push_back(coord{++n, 0, 0});
    }
    sort(_coord_container.begin(), _coord_container.end());
}

这个用C语言的文件流写会更快一些（0.8s），实现方法是类似的，但没有使用atoi的库函数，字符串转整数的操作是我手写的。
我用了@btime测试结果如下：

@btime read_file!(file_name_A, len_a, a_start, a_end)
  3.988 s (65975349 allocations: 3.52 GiB)

AquaIndigo · 2020 年5 月 20 日 05:36

另外我也试过findall 函数来找\t，但速度要比split慢得多，而C++的字符串分割要用正则表达式，速度就很慢了，所以用了手动找\t的方式，findfirst等算法没有尝试。
还有数据的内容大致如下

Sukanka · 2020 年5 月 20 日 05:40

试试这个 GitHub - BioJulia/GFF3.jl

AquaIndigo · 2020 年5 月 20 日 05:55

谢谢，我去看一下，但是其实是要求能处理txt文件的

Sukanka · 2020 年5 月 20 日 07:14

能贴一部分文本出来吗，我觉得慢的原因可能在于 open,你用 readline 试试。
可以看这个帖子 Julia速度极慢 - #32，来自 Sukanka 。这个帖子的主题和你的也很接近。

AquaIndigo · 2020 年5 月 20 日 07:19

我试过eachline，效率差不多

chr1	altscan	start_codon	895169	895171	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	895169	895226	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	896015	896180	2520.15	+	2	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	896673	896965	2520.15	+	1	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	897009	897130	2520.15	+	2	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	897206	897427	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	897666	897851	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	898084	898297	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	898489	898633	2520.15	+	2	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	898717	898884	2520.15	+	1	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	899300	899388	2520.15	+	1	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	899487	899560	2520.15	+	2	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	899729	899910	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	CDS	900343	900568	2520.15	+	1	gene_id "chr1.8"; transcript_id "chr1.8.2";
chr1	altscan	stop_codon	900569	900571	2520.15	+	0	gene_id "chr1.8"; transcript_id "chr1.8.2";

AquaIndigo · 2020 年5 月 26 日 15:26

发现用findnext查找到指定位置会减少许多内存和时间的消耗，大概花1.6s就可以读一次文件，时间是C语言的2倍

johnnychen94 · 2020 年5 月 27 日 03:54

这个不是典型的csv文件么…

没有测试，但是应该 CSV.File(file_name; header=1, ignoreemptylines=true, delim='\t')，然后导入到DataFrames里处理应该就可以了…

Reference: Home · CSV.jl