Processing dataframe column via str_split using values in another column R-CodePudding

I'm wondering if there's a nicer way to address the following problem

I have a dataframe with the following example structure:

Split_key	label	sub_label
A_B_C	7	""
A_B_C	7	""
A_B_C	8	""
A_B_C	8	""
A_B_C	10	""
A_B_C	10	""
D_E_F	2	""
D_E_F	7	""
D_E_F	15	""
G_H_I	1	""
G_H_I	2	""
G_H_I	3	""

I wish to populate sub_label with a value that corresponds to splitting the value in Split_key on the "_" character and grabs the correct element based on label. The correct element is the index of the value in label in the unique sorted array of labels that share the same value in Split_key.

The correct end result is shown here.

Split_key	label	sub_label
A_B_C	7	A
A_B_C	7	A
A_B_C	8	B
A_B_C	8	B
A_B_C	10	C
A_B_C	10	C
D_E_F	2	D
D_E_F	7	E
D_E_F	15	F
G_H_I	1	G
G_H_I	2	H
G_H_I	3	I

Here is my initial attempt.

for (row_n in 1:nrow(df)){
  df%>%filter(`Split_key`==df[row_n,"Split_key"][[1]])->duplicates
  shift<-which(sort(unique(duplicates$label))==df[row_n,"label"][[1]])
  df[row_n,"sub_label"]<-str_split(df[row_n,"Split_key"],"_")[[1]][shift]
}

This solution works but is slower than I'd like with large dataframes. Is there a way to accomplish this task without using a for loop?

CodePudding user response：

We may use factor route i.e. after grouping by 'Split_key', scan the first element of 'Split_key' and use the integer converted factor column 'label' as index

library(dplyr)
df %>%
    group_by(Split_key) %>% 
    mutate(sub_label = scan(text = first(Split_key), what = "", 
      sep="_", quiet = TRUE)[as.integer(factor(label))]) %>%
    ungroup

-output

# A tibble: 12 × 3
   Split_key label sub_label
   <chr>     <int> <chr>    
 1 A_B_C         7 A        
 2 A_B_C         7 A        
 3 A_B_C         8 B        
 4 A_B_C         8 B        
 5 A_B_C        10 C        
 6 A_B_C        10 C        
 7 D_E_F         2 D        
 8 D_E_F         7 E        
 9 D_E_F        15 F        
10 G_H_I         1 G        
11 G_H_I         2 H        
12 G_H_I         3 I

data

df <- structure(list(Split_key = c("A_B_C", "A_B_C", "A_B_C", "A_B_C", 
"A_B_C", "A_B_C", "D_E_F", "D_E_F", "D_E_F", "G_H_I", "G_H_I", 
"G_H_I"), label = c(7L, 7L, 8L, 8L, 10L, 10L, 2L, 7L, 15L, 1L, 
2L, 3L), sub_label = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA)), class = "data.frame", row.names = c(NA, -12L))

CodePudding user response：

Here is an alternative way how you can achieve your goal: Logic:

group by Split_key
Create grouping variable for label using rleid function from data.table
Using some stringr functions to get the aim:

library(dplyr)
library(stringr)
library(data.table)
df %>% 
  group_by(Split_key) %>% 
  mutate(group = rleid(label)) %>%
  mutate(sub_label= str_sub(str_replace_all(Split_key, "[^[:alnum:]]", ""), group, group), .keep="unused")

 Split_key label sub_label
   <chr>     <int> <chr>    
 1 A_B_C         7 A        
 2 A_B_C         7 A        
 3 A_B_C         8 B        
 4 A_B_C         8 B        
 5 A_B_C        10 C        
 6 A_B_C        10 C        
 7 D_E_F         2 D        
 8 D_E_F         7 E        
 9 D_E_F        15 F        
10 G_H_I         1 G        
11 G_H_I         2 H        
12 G_H_I         3 I

CodePudding user response：

This will be faster than a loop:

library(dplyr)
dat %>%
  group_by(Split_key) %>%
  mutate(sub_label2 = strsplit(Split_key[1], "_")[[1]][ match(label, sort(unique(label))) ]) %>%
  ungroup()
# # A tibble: 12 x 4
#    Split_key label sub_label sub_label2
#    <chr>     <int> <chr>     <chr>     
#  1 A_B_C         7 A         A         
#  2 A_B_C         7 A         A         
#  3 A_B_C         8 B         B         
#  4 A_B_C         8 B         B         
#  5 A_B_C        10 C         C         
#  6 A_B_C        10 C         C         
#  7 D_E_F         2 D         D         
#  8 D_E_F         7 E         E         
#  9 D_E_F        15 F         F         
# 10 G_H_I         1 G         G         
# 11 G_H_I         2 H         H         
# 12 G_H_I         3 I         I

If there are fewer elements encoded within Split_key than there are distinct values in sub_label, then you will get NA for those rows.

Walk-through:

group_by(Split_key): since we need to track unique label for each Split_key, then we group in this field and simplify processing to one group at a time;
strsplit(Split_key[1], ")")[[1]]: within a particular group, we only need one of the Split_key values to split, not all of them (since they are all identical), this results internally in a vector such as c("A", "B", "C") in the first group;
match(label, sort(unique(label))) translates (first group) to match(c(7,7,8,8,10,10), c(7,8,10)) which translates to c(1,1,2,2,3,3); this is used to index on the vector c("A","B","C") from the previous bullet