目标:将每个基因的碱基序列合并成一行显示,并且基因ID行不变
fasta文件例子
julia> less("test.fa")
>Gene1
ATTGGGGGGGG
ATTTGG
ATGGG
>Gene2
AGGGGGG
AGGGC
TTGT
第一次尝试,想一个基因序列处理完就输出,但后面有问题,需要修改
io = open("new1.fa","w")
open("test.fa") do file
for line in eachline(file)
#println(line)
if(startswith(line,r"^>"))
write(io,line*"\n")
else
write(io,chomp(line))
end
end
end
close(io)
查看结果,发现从第二个基因ID开始出现问题,ID和序列未分开
julia> less("new1.fa")
>aaabbb
ATTGGGGGGGGATTTGGATGGG>cccccddd
AGGGGGGAGGGCTTGT
查看其它语言代码后,改用字典来解决,注意这里有个地方必须添加global定义全局,否则报错,希望大佬能帮忙修改完善下,提前感谢
io = open("new2.fa","w")
seq = Dict()
open("test.fa") do file
for line in eachline(file)
if(startswith(line,r"^>"))
global GeneName = chomp(line)
seq[GeneName] = ""
else
seq[GeneName] *= chomp(line)
end
end
end
for i in keys(seq)
write(io,i*"\n")
write(io,seq[i]*"\n")
end
close(io)
查看结果,满足目标
julia> less("new2.fa")
>Gene2
AGGGGGGAGGGCTTGT
>Gene1
ATTGGGGGGGGATTTGGATGGG
谢谢建议,已修改成函数后仍然有几个问题想请教下,目标文件和结果文件属于什么参数类型呢我这里定义为AbstractString不知道是否恰当?函数的输出结果又该定义为什么类型呢?
function JlFasToOneRow(ObjFile::AbstractString;ResFile = string("new_",ObjFile))
io = open(ResFile,"w")
seq = Dict{AbstractString,AbstractString}()
open(ObjFile) do file
for line in eachline(file)
if(startswith(line,r"^>"))
global GeneName = chomp(line)
seq[GeneName] = ""
else
seq[GeneName] *= chomp(line)
end
end
end
for i in keys(seq)
write(io,i*"\n")
write(io,seq[i]*"\n")
end
close(io)
end
定义为AbstractString不知道是否恰当
就目前看来恰当
函数的输出结果又该定义为什么类型呢
返回值正常不会被使用,就没必要管(也可以考虑返回 resfile
之类)
- 这命名不符合规范
startswith(line, r"^>")
没必要用正则表达式
global GeneName
这里没必要全局作用域
感谢建议。但是global不加会报错,该怎么修改呢
UndefVarError: GeneName not defined
谢谢!
function JlFasToOneRow(ObjFile::AbstractString;ResFile = string("new_",ObjFile))
io = open(ResFile,"w")
seq = Dict{AbstractString,AbstractString}()
GeneName = ""
open(ObjFile) do file
for line in eachline(file)
if(startswith(line,">"))
GeneName = chomp(line)
seq[GeneName] = ""
else
seq[GeneName] *= chomp(line)
end
end
end
for i in keys(seq)
write(io,i*"\n")
write(io,seq[i]*"\n")
end
close(io)
end
抱歉可能是我没有理解,先将数据转存到Dict
中是有什么目的吗?
function JlFasToOneRow(ObjFile::AbstractString;ResFile = string("new_",ObjFile))
io = open(ResFile,"w")
open(ObjFile) do file
for line in eachline(ObjFile)
if (line[1]=='>')
write(io, "\n$(line)\n")
else
write(io, line)
end
end
end
close(io)
end
似乎这样就能满足需求.
另外, 构造默认ResFile的方法对于输入是路径而不是文件名的情况下不够健壮, 例如输入./test.fa
.
比较好的做法是实现一个JlFasToOneRow(ObjFile::AbstractString, ResFile::AbstractString)
, 再重载一个:
function JlFasToOneRow(ObjFile::AbstractString)
path_split = splitpath(ObjFile)
path_split[end] = "new_$(path_split[end])"
new_path = joinpath(path_split)
JlFasToOneRow(ObjFile, new_path)
end
if (line[1]==‘>’)
write(io, “\n$(line)\n”)
其实我最开始就这样做过,就是这样会在最开始多一行空行,还有就是方便我后面继续使用
julia> less("new_test.fa")
>Gene1
ATTGGGGGGGGATTTGGATGGG
>Gene2
AGGGGGGAGGGCTTGT
>Gene3
AAAGGGGGGCCCTTTGGG
感谢建议,确实对不在同一路径下的文件名就不适用了。
这段代码又学到了(纳为己用,谢谢! )
另外,这个 joinpath(path_split)应该得修改为:
new_path = joinpath(path_split…)
joinpath
应该是版本的原因,我用的1.7.3
上已经有joinpath(parts::Vector{AbstractString}) -> String
的实现。