Home > database >  Extract elements from lfile into a list in R
Extract elements from lfile into a list in R

Time:12-17

I have a file such as :

>Nscaffold_033778.1_22 [24674 - 24880] some information  
ACCATTAAGAGAGAAAAGAGAGGAGAGAGAGAGAGGAGAGAGAGAGAGAGGagAGGAAGA
AGGAGAGAGA
>NC_0337652.1_23 [26291 - 26443] some other informations boring
MMDOODODODODNJBCIOICICVOCICVCPCCM
>contig_033652.1_24 [25507 - 26529] species with informations 
AJGSIVPDYVPDYVDPYVDPYDVDYVPVYVIYVDPIDVYPDIVYDPIYVDPIDVYPDIVP
PUDVPIYDVPDIVPDIDVPDVPDIVDPVDIVPDIVPDIVDPIDVIDVPDDIVPDDVPDVD
DDGGDDGDDIDIDDFDUDUDTTUDDUCDUDCDCC

And I would like to extract in a list format only the following informations :

the list:

[[1]]
[1] "Nscaffold_033778.1_22" "24674"        "24880"       

[[2]]
[1] "NC_0337652.1_23" "26291"        "26443"       

[[3]]
[1] "contig_033652.1_24" "25507"         "26529"  

Does someone have an idea ??

  • The first element of the list is the part after the ">" symbols,
  • The second element of the list is the first number within the []
  • The third element of the list is the second number within the []

CodePudding user response:

We could read the file with readLines, grep the lines, extract the relevant info and split

strsplit(sub("^>(\\S )\\s \\[(\\d )\\D (\\d )\\].*", "\\1,\\2,\\3", 
    grep(">", lines, value = TRUE)), ",")

-output

[[1]]
[1] "Nscaffold_033778.1_22" "24674"                 "24880"                

[[2]]
[1] "NC_0337652.1_23" "26291"           "26443"          

[[3]]
[1] "contig_033652.1_24" "25507"              "26529"      

data

lines <- readLines('file.txt')

CodePudding user response:

If vec holds the file contents,

vec <- readLines(...)

then

strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
           vec[grepl("^>", vec)],
           list(x="",y="",z=""))
#                        x     y     z
# 1 Nscaffold_033778.1_22  24674 24880
# 2       NC_0337652.1_23  26291 26443
# 3    contig_033652.1_24  25507 26529

I recognize this is not strictly the format requested. I offer it as an alternate, as it gives ready access to all of the contents. Further, if you intend to integer-ize the columns y and z, then you can do that built-in by replacing the third argument (proto=) with list(x="", y=1L, z=1L), as in

str(
  strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
             vec[grepl("^>", vec)],
             list(x="",y="",z=""))
)
# 'data.frame': 3 obs. of  3 variables:
#  $ x: chr  "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
#  $ y: chr  "24674" "26291" "25507"
#  $ z: chr  "24880" "26443" "26529"

str(
  strcapture("^>(.*) *\\[(\\d )\\D*(\\d ).*",
             vec[grepl("^>", vec)],
             list(x="",y=1L,z=1L))
)
# 'data.frame': 3 obs. of  3 variables:
#  $ x: chr  "Nscaffold_033778.1_22 " "NC_0337652.1_23 " "contig_033652.1_24 "
#  $ y: int  24674 26291 25507
#  $ z: int  24880 26443 26529
  • Related