I have a dataframe (R) where I have around 6000 different strings (column S). It might happen that the string before the underscore is swapped and you can find it the string after the underscore. Here an example:
df <- data.frame(val= sample(1:100, 50, replace=T),
S=c(paste0(rep('A',5),'_',rep('B',5)),
paste0(rep('C',10),'_',rep('D',10)),
paste0(rep('B',3),'_', rep('A',3)),
paste0(rep('C',7),'_', rep('A',7)),
paste0(rep('E',20),'_',rep('F',20)),
paste0(rep('F',5),'_', rep('G',5))))
I need to put always the string before and after the underscore in the same order, because they are actually the same (it doesn't matter the order, it is important that they are in the same order). I am loosing my mind on how to do it.
I was trying to split the string S and adding the two string information:
> df$L1 <- unlist(lapply(strsplit(df$S,'_'), function(x) x[1]))
> df$L2 <-unlist(lapply(strsplit(df$S,'_'), function(x) x[2]))
I was thinking to apply a for loop but I don't know how to check all the different combinations and store the information.
Could you help me?
CodePudding user response:
If the order doesn't matter sort
might be helpful here.
cbind(df, S_new = sapply(strsplit(df$S, "_"), function(x)
paste(sort(x), collapse="_")))
val S S_new
1 44 A_B A_B
2 84 A_B A_B
3 78 A_B A_B
4 95 A_B A_B
5 87 A_B A_B
6 70 C_D C_D
7 34 C_D C_D
8 55 C_D C_D
9 94 C_D C_D
10 94 C_D C_D
11 7 C_D C_D
12 14 C_D C_D
13 60 C_D C_D
14 58 C_D C_D
15 37 C_D C_D
16 8 B_A A_B
17 31 B_A A_B
18 64 B_A A_B
...
CodePudding user response:
For this we could use separate
from tidyr
:
library(dplyr)
library(tidyr)
df %>%
separate(S, into=c("L1", "L2"), sep = "_", remove = FALSE)
val S L1 L2
1 40 A_B A B
2 64 A_B A B
3 49 A_B A B
4 71 A_B A B
5 98 A_B A B
6 1 C_D C D
7 53 C_D C D
8 99 C_D C D
9 13 C_D C D
10 70 C_D C D
11 82 C_D C D
12 49 C_D C D
13 42 C_D C D
14 44 C_D C D
15 57 C_D C D
16 57 B_A B A
17 92 B_A B A
18 83 B_A B A
19 43 C_A C A
20 45 C_A C A
21 70 C_A C A
22 97 C_A C A
23 28 C_A C A
24 18 C_A C A
25 18 C_A C A
26 54 E_F E F
27 19 E_F E F
28 70 E_F E F
29 43 E_F E F
30 69 E_F E F
31 50 E_F E F
32 30 E_F E F
33 36 E_F E F
34 96 E_F E F
35 48 E_F E F
36 39 E_F E F
37 77 E_F E F
38 91 E_F E F
39 83 E_F E F
40 95 E_F E F
41 33 E_F E F
42 74 E_F E F
43 27 E_F E F
44 8 E_F E F
45 19 E_F E F
46 34 F_G F G
47 75 F_G F G
48 67 F_G F G
49 17 F_G F G
50 4 F_G F G