Using R::split() to separate data INTO groups according to names-CodePudding

Consider the following structure called x whose output is a vector in R:

> x
    A
    A
    A
    B
    B
    C

I'd like to use split to split x into 3 groups A, B, and C where A has 3 elements, B has 2, and C has 1.

What should the grouping factor argument, f, be in split()?

The above is a trivial example. My structure is much larger.

My real example consists of FASTA headers where multiple DNA sequences correspond to the same species and I need to split according to species. However, the species name occurs in the header like this:

">COLFG678-14|MZ630002|Agabus|adpressus|AEC6988|COI-5P"

Here the species is Agabus adpressus.

As I am unsure of the most appropriate output at this stage, it could look like

$`Agabus adpressus`
Seq1
Seq2
Seq3

CodePudding user response：

I imagine that your real data is not uniform in that every string for the same species is exactly the same. In that case, you need to pull the species out of the string to split on:

vals <- c(">COLFG678-14|MZ630002|Agabus|adpressus|AEC6988|COI-5P",
                 ">CZLFG631-11|MZ730009|Agabus|adpressus|BSF8945|AOL-5N",
                 ">XOLGG558-12|MK630011|Agabus|adpressus|JLD6018|CVI-1P",
                 ">YPLFG578-81|JF830122|Agabus|ajax|XCV0091|CMM-1N",
                 ">CLVFG679-13|KA301202|Agabus|ajax|FFP1111|AND-5Z")


split(vals, sub("(?:(.*)\\|){2}(\\w )\\|(\\w )\\|.*?$", "\\1-\\2", vals))
#> $`Agabus-adpressus`
#> [1] ">COLFG678-14|MZ630002|Agabus|adpressus|AEC6988|COI-5P"
#> [2] ">CZLFG631-11|MZ730009|Agabus|adpressus|BSF8945|AOL-5N"
#> [3] ">XOLGG558-12|MK630011|Agabus|adpressus|JLD6018|CVI-1P"
#> 
#> $`Agabus-ajax`
#> [1] ">YPLFG578-81|JF830122|Agabus|ajax|XCV0091|CMM-1N"
#> [2] ">CLVFG679-13|KA301202|Agabus|ajax|FFP1111|AND-5Z"

CodePudding user response：

read.table(text = vals, sep='|')|>
   split(~paste(V3,V4))|>
   map(~invoke(str_c, .x, sep='|'))

$`Agabus adpressus`
[1] ">COLFG678-14|MZ630002|Agabus|adpressus|AEC6988|COI-5P"
[2] ">CZLFG631-11|MZ730009|Agabus|adpressus|BSF8945|AOL-5N"
[3] ">XOLGG558-12|MK630011|Agabus|adpressus|JLD6018|CVI-1P"

$`Agabus ajax`
[1] ">YPLFG578-81|JF830122|Agabus|ajax|XCV0091|CMM-1N"
[2] ">CLVFG679-13|KA301202|Agabus|ajax|FFP1111|AND-5Z"

Although you could split, I would recommend using group_by which works similarly and its easier to use