I´m pretty new on using R and maybe the question is simple.
I have a character vector with possible combinations of Letters. For example:
[1] "YMC" "YCM" "MYC" "CMY" "CYM" "MCY" "MEH" "HEM" "EMH" "MHE" "EHM" "HME"
[13] "CFF" "FCF" "FFC" "AYY" "YFS" "YYA" "SFY" "YSF" "FSY" "SYF" "YAY" "FYS"
[25] "HYP" "HPY" "WNP" "PWN" "PHY" "PNW" "YHP" "PYH" "WPN" "NPW" "YPH" "NWP"
[37] "BHF" "FHB" "BFH" "HBF" "FBH" "HFB" "BQR" "QRB" "BRQ" "RBQ" "QBR" "RQB"
[49] "BRK" "KRB" "RBK" "BKR" "RKB" "KBR" "WDP" "DPW" "DWP" "WPD" "PDW" "PWD"
And I want to know which strings share the same letter (are made from the same letter but in a different combination).
As you can see the first 6 string come all from "C" "Y" "M" and the second 6 from "M" "E" "H".
Or "GPWG" "GWGP" "GPGW" "PWGG" "GGPW" "PGWG" come from:"G" "G" "W" "P"
What kind of code in R I can use to answer to this question in an automatised way?
Many thanks for your help
CodePudding user response:
Supposing your character vector is called vec
, you can do:
ordered <- sapply(strsplit(vec, ''), function(x) paste(sort(x), collapse = ''))
df <- data.frame(string = vec,
letters = ordered,
group = match(ordered, unique(ordered)))
Which gives you a data frame with a column for the original vector, the characters it uses in alphabetical order, and a grouping variable so that you can identify which other strings are made with the same letter combinations:
df
#> string letters group
#> 1 YMC CMY 1
#> 2 YCM CMY 1
#> 3 MYC CMY 1
#> 4 CMY CMY 1
#> 5 CYM CMY 1
#> 6 MCY CMY 1
#> 7 MEH EHM 2
#> 8 HEM EHM 2
#> 9 EMH EHM 2
#> 10 MHE EHM 2
#> 11 EHM EHM 2
#> 12 HME EHM 2
#> 13 CFF CFF 3
#> 14 FCF CFF 3
#> 15 FFC CFF 3
#> 16 AYY AYY 4
#> 17 YFS FSY 5
#> 18 YYA AYY 4
#> 19 SFY FSY 5
#> 20 YSF FSY 5
#> 21 FSY FSY 5
#> 22 SYF FSY 5
#> 23 YAY AYY 4
#> 24 FYS FSY 5
#> 25 HYP HPY 6
#> 26 HPY HPY 6
#> 27 WNP NPW 7
#> 28 PWN NPW 7
#> 29 PHY HPY 6
#> 30 PNW NPW 7
#> 31 YHP HPY 6
#> 32 PYH HPY 6
#> 33 WPN NPW 7
#> 34 NPW NPW 7
#> 35 YPH HPY 6
#> 36 NWP NPW 7
#> 37 BHF BFH 8
#> 38 FHB BFH 8
#> 39 BFH BFH 8
#> 40 HBF BFH 8
#> 41 FBH BFH 8
#> 42 HFB BFH 8
#> 43 BQR BQR 9
#> 44 QRB BQR 9
#> 45 BRQ BQR 9
#> 46 RBQ BQR 9
#> 47 QBR BQR 9
#> 48 RQB BQR 9
#> 49 BRK BKR 10
#> 50 KRB BKR 10
#> 51 RBK BKR 10
#> 52 BKR BKR 10
#> 53 RKB BKR 10
#> 54 KBR BKR 10
#> 55 WDP DPW 11
#> 56 DPW DPW 11
#> 57 DWP DPW 11
#> 58 WPD DPW 11
#> 59 PDW DPW 11
#> 60 PWD DPW 11
Data from question in reproducible format
vec <- c("YMC", "YCM", "MYC", "CMY", "CYM", "MCY", "MEH", "HEM", "EMH",
"MHE", "EHM", "HME", "CFF", "FCF", "FFC", "AYY", "YFS", "YYA",
"SFY", "YSF", "FSY", "SYF", "YAY", "FYS", "HYP", "HPY", "WNP",
"PWN", "PHY", "PNW", "YHP", "PYH", "WPN", "NPW", "YPH", "NWP",
"BHF", "FHB", "BFH", "HBF", "FBH", "HFB", "BQR", "QRB", "BRQ",
"RBQ", "QBR", "RQB", "BRK", "KRB", "RBK", "BKR", "RKB", "KBR",
"WDP", "DPW", "DWP", "WPD", "PDW", "PWD")