Julia尝试将基因序列多行的fasta文件转为一行

目标:将每个基因的碱基序列合并成一行显示,并且基因ID行不变
fasta文件例子

julia> less("test.fa")
>Gene1
ATTGGGGGGGG
ATTTGG
ATGGG
>Gene2
AGGGGGG
AGGGC
TTGT

第一次尝试,想一个基因序列处理完就输出,但后面有问题,需要修改

io = open("new1.fa","w")
open("test.fa") do file
	for line in eachline(file)
		#println(line)
		if(startswith(line,r"^>"))
			write(io,line*"\n")
		else
			write(io,chomp(line))
		end
	end
end
close(io)

查看结果,发现从第二个基因ID开始出现问题,ID和序列未分开

julia> less("new1.fa")
>aaabbb
ATTGGGGGGGGATTTGGATGGG>cccccddd
AGGGGGGAGGGCTTGT

查看其它语言代码后,改用字典来解决,注意这里有个地方必须添加global定义全局,否则报错,希望大佬能帮忙修改完善下,提前感谢

io = open("new2.fa","w")
seq = Dict()
open("test.fa") do file
	for line in eachline(file)
		if(startswith(line,r"^>"))
			global GeneName = chomp(line)
			seq[GeneName] = ""
		else
			seq[GeneName] *= chomp(line)
		end
	end
end
for i in keys(seq)
	write(io,i*"\n")
	write(io,seq[i]*"\n")
end
close(io)

查看结果,满足目标

julia> less("new2.fa")
>Gene2
AGGGGGGAGGGCTTGT
>Gene1
ATTGGGGGGGGATTTGGATGGG

希望能帮忙修改完善下

(性能建议)

  1. 写函数
  2. seq 的类型可以更精确些

谢谢建议,已修改成函数后仍然有几个问题想请教下,目标文件和结果文件属于什么参数类型呢我这里定义为AbstractString不知道是否恰当?函数的输出结果又该定义为什么类型呢?

function JlFasToOneRow(ObjFile::AbstractString;ResFile = string("new_",ObjFile))
	io = open(ResFile,"w")
	seq = Dict{AbstractString,AbstractString}()
	open(ObjFile) do file
		for line in eachline(file)
			if(startswith(line,r"^>"))
				global GeneName = chomp(line)
				seq[GeneName] = ""
			else
				seq[GeneName] *= chomp(line)
			end
		end
	end
	for i in keys(seq)
		write(io,i*"\n")
		write(io,seq[i]*"\n")
	end
	close(io)
end

定义为AbstractString不知道是否恰当

就目前看来恰当

函数的输出结果又该定义为什么类型呢

返回值正常不会被使用,就没必要管(也可以考虑返回 resfile 之类)

  • 这命名不符合规范
  • startswith(line, r"^>") 没必要用正则表达式
  • global GeneName 这里没必要全局作用域

感谢建议。但是global不加会报错,该怎么修改呢

UndefVarError: GeneName not defined

在函数作用域下加 GeneName=""

谢谢! :+1:

function JlFasToOneRow(ObjFile::AbstractString;ResFile = string("new_",ObjFile))
	io = open(ResFile,"w")
	seq = Dict{AbstractString,AbstractString}()
	GeneName = ""
	open(ObjFile) do file
		for line in eachline(file)
			if(startswith(line,">"))
				GeneName = chomp(line)
				seq[GeneName] = ""
			else
				seq[GeneName] *= chomp(line)
			end
		end
	end
	for i in keys(seq)
		write(io,i*"\n")
		write(io,seq[i]*"\n")
	end
	close(io)
end

抱歉可能是我没有理解,先将数据转存到Dict中是有什么目的吗?

function JlFasToOneRow(ObjFile::AbstractString;ResFile = string("new_",ObjFile))
    io = open(ResFile,"w")
    open(ObjFile) do file
        for line in eachline(ObjFile)
            if (line[1]=='>')
                write(io, "\n$(line)\n")
            else
                write(io, line)
            end
        end
    end
    close(io)
end

似乎这样就能满足需求.
另外, 构造默认ResFile的方法对于输入是路径而不是文件名的情况下不够健壮, 例如输入./test.fa.
比较好的做法是实现一个JlFasToOneRow(ObjFile::AbstractString, ResFile::AbstractString), 再重载一个:

function JlFasToOneRow(ObjFile::AbstractString)
    path_split = splitpath(ObjFile)
    path_split[end] = "new_$(path_split[end])"
    new_path = joinpath(path_split)
    JlFasToOneRow(ObjFile, new_path)
end

if (line[1]==‘>’)
write(io, “\n$(line)\n”)

其实我最开始就这样做过,就是这样会在最开始多一行空行,还有就是方便我后面继续使用

julia> less("new_test.fa")

>Gene1
ATTGGGGGGGGATTTGGATGGG
>Gene2
AGGGGGGAGGGCTTGT
>Gene3
AAAGGGGGGCCCTTTGGG

感谢建议,确实对不在同一路径下的文件名就不适用了。
这段代码又学到了(纳为己用,谢谢! :grin:
另外,这个 joinpath(path_split)应该得修改为:

new_path = joinpath(path_split…)

joinpath应该是版本的原因,我用的1.7.3上已经有joinpath(parts::Vector{AbstractString}) -> String的实现。

好吧,尴尬了, :sweat_smile: