Home > database >  How I know if a string come from the same combination of letter in R?
How I know if a string come from the same combination of letter in R?

Time:06-03

I´m pretty new on using R and maybe the question is simple.

I have a character vector with possible combinations of Letters. For example:

[1] "YMC" "YCM" "MYC" "CMY" "CYM" "MCY" "MEH" "HEM" "EMH" "MHE" "EHM" "HME"
[13] "CFF" "FCF" "FFC" "AYY" "YFS" "YYA" "SFY" "YSF" "FSY" "SYF" "YAY" "FYS"
[25] "HYP" "HPY" "WNP" "PWN" "PHY" "PNW" "YHP" "PYH" "WPN" "NPW" "YPH" "NWP"
[37] "BHF" "FHB" "BFH" "HBF" "FBH" "HFB" "BQR" "QRB" "BRQ" "RBQ" "QBR" "RQB"
[49] "BRK" "KRB" "RBK" "BKR" "RKB" "KBR" "WDP" "DPW" "DWP" "WPD" "PDW" "PWD"

And I want to know which strings share the same letter (are made from the same letter but in a different combination).

As you can see the first 6 string come all from "C" "Y" "M" and the second 6 from "M" "E" "H".

Or "GPWG" "GWGP" "GPGW" "PWGG" "GGPW" "PGWG" come from:"G" "G" "W" "P"

What kind of code in R I can use to answer to this question in an automatised way?

Many thanks for your help

CodePudding user response:

Supposing your character vector is called vec, you can do:

ordered <-  sapply(strsplit(vec, ''), function(x) paste(sort(x), collapse = ''))

df <- data.frame(string = vec, 
           letters = ordered, 
           group = match(ordered, unique(ordered)))

Which gives you a data frame with a column for the original vector, the characters it uses in alphabetical order, and a grouping variable so that you can identify which other strings are made with the same letter combinations:

df

#>    string letters group
#> 1     YMC     CMY     1
#> 2     YCM     CMY     1
#> 3     MYC     CMY     1
#> 4     CMY     CMY     1
#> 5     CYM     CMY     1
#> 6     MCY     CMY     1
#> 7     MEH     EHM     2
#> 8     HEM     EHM     2
#> 9     EMH     EHM     2
#> 10    MHE     EHM     2
#> 11    EHM     EHM     2
#> 12    HME     EHM     2
#> 13    CFF     CFF     3
#> 14    FCF     CFF     3
#> 15    FFC     CFF     3
#> 16    AYY     AYY     4
#> 17    YFS     FSY     5
#> 18    YYA     AYY     4
#> 19    SFY     FSY     5
#> 20    YSF     FSY     5
#> 21    FSY     FSY     5
#> 22    SYF     FSY     5
#> 23    YAY     AYY     4
#> 24    FYS     FSY     5
#> 25    HYP     HPY     6
#> 26    HPY     HPY     6
#> 27    WNP     NPW     7
#> 28    PWN     NPW     7
#> 29    PHY     HPY     6
#> 30    PNW     NPW     7
#> 31    YHP     HPY     6
#> 32    PYH     HPY     6
#> 33    WPN     NPW     7
#> 34    NPW     NPW     7
#> 35    YPH     HPY     6
#> 36    NWP     NPW     7
#> 37    BHF     BFH     8
#> 38    FHB     BFH     8
#> 39    BFH     BFH     8
#> 40    HBF     BFH     8
#> 41    FBH     BFH     8
#> 42    HFB     BFH     8
#> 43    BQR     BQR     9
#> 44    QRB     BQR     9
#> 45    BRQ     BQR     9
#> 46    RBQ     BQR     9
#> 47    QBR     BQR     9
#> 48    RQB     BQR     9
#> 49    BRK     BKR    10
#> 50    KRB     BKR    10
#> 51    RBK     BKR    10
#> 52    BKR     BKR    10
#> 53    RKB     BKR    10
#> 54    KBR     BKR    10
#> 55    WDP     DPW    11
#> 56    DPW     DPW    11
#> 57    DWP     DPW    11
#> 58    WPD     DPW    11
#> 59    PDW     DPW    11
#> 60    PWD     DPW    11

Data from question in reproducible format

vec <- c("YMC", "YCM", "MYC", "CMY", "CYM", "MCY", "MEH", "HEM", "EMH", 
         "MHE", "EHM", "HME", "CFF", "FCF", "FFC", "AYY", "YFS", "YYA", 
         "SFY", "YSF", "FSY", "SYF", "YAY", "FYS", "HYP", "HPY", "WNP", 
         "PWN", "PHY", "PNW", "YHP", "PYH", "WPN", "NPW", "YPH", "NWP", 
         "BHF", "FHB", "BFH", "HBF", "FBH", "HFB", "BQR", "QRB", "BRQ", 
         "RBQ", "QBR", "RQB", "BRK", "KRB", "RBK", "BKR", "RKB", "KBR", 
         "WDP", "DPW", "DWP", "WPD", "PDW", "PWD")
  • Related