R - swap sequences after and before the underscore to have a consistent string in a dataframe-CodePudding

I have a dataframe (R) where I have around 6000 different strings (column S). It might happen that the string before the underscore is swapped and you can find it the string after the underscore. Here an example:

df <- data.frame(val= sample(1:100, 50, replace=T),
            S=c(paste0(rep('A',5),'_',rep('B',5)),
                paste0(rep('C',10),'_',rep('D',10)),
                paste0(rep('B',3),'_', rep('A',3)),
                paste0(rep('C',7),'_', rep('A',7)),
                paste0(rep('E',20),'_',rep('F',20)),
                paste0(rep('F',5),'_', rep('G',5))))

I need to put always the string before and after the underscore in the same order, because they are actually the same (it doesn't matter the order, it is important that they are in the same order). I am loosing my mind on how to do it.

I was trying to split the string S and adding the two string information:

> df$L1 <- unlist(lapply(strsplit(df$S,'_'), function(x) x[1])) 
> df$L2 <-unlist(lapply(strsplit(df$S,'_'), function(x) x[2]))

I was thinking to apply a for loop but I don't know how to check all the different combinations and store the information.

Could you help me?

CodePudding user response：

If the order doesn't matter sort might be helpful here.

cbind(df, S_new = sapply(strsplit(df$S, "_"), function(x) 
  paste(sort(x), collapse="_")))
   val   S S_new
1   44 A_B   A_B
2   84 A_B   A_B
3   78 A_B   A_B
4   95 A_B   A_B
5   87 A_B   A_B
6   70 C_D   C_D
7   34 C_D   C_D
8   55 C_D   C_D
9   94 C_D   C_D
10  94 C_D   C_D
11   7 C_D   C_D
12  14 C_D   C_D
13  60 C_D   C_D
14  58 C_D   C_D
15  37 C_D   C_D
16   8 B_A   A_B
17  31 B_A   A_B
18  64 B_A   A_B
...

CodePudding user response：

For this we could use separate from tidyr:

library(dplyr)
library(tidyr)
df %>% 
  separate(S, into=c("L1", "L2"), sep = "_", remove = FALSE)

   val   S L1 L2
1   40 A_B  A  B
2   64 A_B  A  B
3   49 A_B  A  B
4   71 A_B  A  B
5   98 A_B  A  B
6    1 C_D  C  D
7   53 C_D  C  D
8   99 C_D  C  D
9   13 C_D  C  D
10  70 C_D  C  D
11  82 C_D  C  D
12  49 C_D  C  D
13  42 C_D  C  D
14  44 C_D  C  D
15  57 C_D  C  D
16  57 B_A  B  A
17  92 B_A  B  A
18  83 B_A  B  A
19  43 C_A  C  A
20  45 C_A  C  A
21  70 C_A  C  A
22  97 C_A  C  A
23  28 C_A  C  A
24  18 C_A  C  A
25  18 C_A  C  A
26  54 E_F  E  F
27  19 E_F  E  F
28  70 E_F  E  F
29  43 E_F  E  F
30  69 E_F  E  F
31  50 E_F  E  F
32  30 E_F  E  F
33  36 E_F  E  F
34  96 E_F  E  F
35  48 E_F  E  F
36  39 E_F  E  F
37  77 E_F  E  F
38  91 E_F  E  F
39  83 E_F  E  F
40  95 E_F  E  F
41  33 E_F  E  F
42  74 E_F  E  F
43  27 E_F  E  F
44   8 E_F  E  F
45  19 E_F  E  F
46  34 F_G  F  G
47  75 F_G  F  G
48  67 F_G  F  G
49  17 F_G  F  G
50   4 F_G  F  G