I have a file such as :
>Nscaffold_033778.1_22 [24674 - 24880] some information
ACCATTAAGAGAGAAAAGAGAGGAGAGAGAGAGAGGAGAGAGAGAGAGAGGagAGGAAGA
AGGAGAGAGA
>NC_0337652.1_23 [26291 - 26443] some other informations boring
MMDOODODODODNJBCIOICICVOCICVCPCCM
>contig_033652.1_24 [25507 - 26529] species with informations
AJGSIVPDYVPDYVDPYVDPYDVDYVPVYVIYVDPIDVYPDIVYDPIYVDPIDVYPDIVP
PUDVPIYDVPDIVPDIDVPDVPDIVDPVDIVPDIVPDIVDPIDVIDVPDDIVPDDVPDVD
DDGGDDGDDIDIDDFDUDUDTTUDDUCDUDCDCC
And I would like to extract in a list format only the following informations :
the list:
[[1]]
[1] "Nscaffold_033778.1_22" "24674" "24880"
[[2]]
[1] "NC_0337652.1_23" "26291" "26443"
[[3]]
[1] "contig_033652.1_24" "25507" "26529"
Does someone have an idea ??
- The first element of the list is the part after the
">"
symbols, - The second element of the list is the
first number
within the[]
- The third element of the list is the
second number
within the[]
CodePudding user response:
We could read the file with readLines
, grep
the lines, extract the relevant info and split
strsplit(sub("^>(\\S )\\s \\[(\\d )\\D (\\d )\\].*", "\\1,\\2,\\3",
grep(">", lines, value = TRUE)), ",")
-output
[[1]]
[1] "Nscaffold_033778.1_22" "24674" "24880"
[[2]]
[1] "NC_0337652.1_23" "26291" "26443"
[[3]]
[1] "contig_033652.1_24" "25507" "26529"
data
lines <- readLines('file.txt')
CodePudding user response:
If vec
holds the file contents,
vec <- readLines(...)
then
strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
vec[grepl("^>", vec)],
list(x="",y="",z=""))
# x y z
# 1 Nscaffold_033778.1_22 24674 24880
# 2 NC_0337652.1_23 26291 26443
# 3 contig_033652.1_24 25507 26529
I recognize this is not strictly the format requested. I offer it as an alternate, as it gives ready access to all of the contents. Further, if you intend to integer-ize the columns y
and z
, then you can do that built-in by replacing the third argument (proto=
) with list(x="", y=1L, z=1L)
, as in
str(
strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
vec[grepl("^>", vec)],
list(x="",y="",z=""))
)
# 'data.frame': 3 obs. of 3 variables:
# $ x: chr "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
# $ y: chr "24674" "26291" "25507"
# $ z: chr "24880" "26443" "26529"
str(
strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
vec[grepl("^>", vec)],
list(x="",y=1L,z=1L))
)
# 'data.frame': 3 obs. of 3 variables:
# $ x: chr "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
# $ y: int 24674 26291 25507
# $ z: int 24880 26443 26529