I'm wondering if there's a nicer way to address the following problem
I have a dataframe with the following example structure:
Split_key | label | sub_label |
---|---|---|
A_B_C | 7 | "" |
A_B_C | 7 | "" |
A_B_C | 8 | "" |
A_B_C | 8 | "" |
A_B_C | 10 | "" |
A_B_C | 10 | "" |
D_E_F | 2 | "" |
D_E_F | 7 | "" |
D_E_F | 15 | "" |
G_H_I | 1 | "" |
G_H_I | 2 | "" |
G_H_I | 3 | "" |
I wish to populate sub_label with a value that corresponds to splitting the value in Split_key on the "_" character and grabs the correct element based on label. The correct element is the index of the value in label in the unique sorted array of labels that share the same value in Split_key.
The correct end result is shown here.
Split_key | label | sub_label |
---|---|---|
A_B_C | 7 | A |
A_B_C | 7 | A |
A_B_C | 8 | B |
A_B_C | 8 | B |
A_B_C | 10 | C |
A_B_C | 10 | C |
D_E_F | 2 | D |
D_E_F | 7 | E |
D_E_F | 15 | F |
G_H_I | 1 | G |
G_H_I | 2 | H |
G_H_I | 3 | I |
Here is my initial attempt.
for (row_n in 1:nrow(df)){
df%>%filter(`Split_key`==df[row_n,"Split_key"][[1]])->duplicates
shift<-which(sort(unique(duplicates$label))==df[row_n,"label"][[1]])
df[row_n,"sub_label"]<-str_split(df[row_n,"Split_key"],"_")[[1]][shift]
}
This solution works but is slower than I'd like with large dataframes. Is there a way to accomplish this task without using a for loop?
CodePudding user response:
We may use factor
route i.e. after grouping by 'Split_key', scan
the first
element of 'Split_key' and use the integer
converted factor
column 'label' as index
library(dplyr)
df %>%
group_by(Split_key) %>%
mutate(sub_label = scan(text = first(Split_key), what = "",
sep="_", quiet = TRUE)[as.integer(factor(label))]) %>%
ungroup
-output
# A tibble: 12 × 3
Split_key label sub_label
<chr> <int> <chr>
1 A_B_C 7 A
2 A_B_C 7 A
3 A_B_C 8 B
4 A_B_C 8 B
5 A_B_C 10 C
6 A_B_C 10 C
7 D_E_F 2 D
8 D_E_F 7 E
9 D_E_F 15 F
10 G_H_I 1 G
11 G_H_I 2 H
12 G_H_I 3 I
data
df <- structure(list(Split_key = c("A_B_C", "A_B_C", "A_B_C", "A_B_C",
"A_B_C", "A_B_C", "D_E_F", "D_E_F", "D_E_F", "G_H_I", "G_H_I",
"G_H_I"), label = c(7L, 7L, 8L, 8L, 10L, 10L, 2L, 7L, 15L, 1L,
2L, 3L), sub_label = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -12L))
CodePudding user response:
Here is an alternative way how you can achieve your goal: Logic:
- group by
Split_key
- Create grouping variable for
label
usingrleid
function fromdata.table
- Using some stringr functions to get the aim:
library(dplyr)
library(stringr)
library(data.table)
df %>%
group_by(Split_key) %>%
mutate(group = rleid(label)) %>%
mutate(sub_label= str_sub(str_replace_all(Split_key, "[^[:alnum:]]", ""), group, group), .keep="unused")
Split_key label sub_label
<chr> <int> <chr>
1 A_B_C 7 A
2 A_B_C 7 A
3 A_B_C 8 B
4 A_B_C 8 B
5 A_B_C 10 C
6 A_B_C 10 C
7 D_E_F 2 D
8 D_E_F 7 E
9 D_E_F 15 F
10 G_H_I 1 G
11 G_H_I 2 H
12 G_H_I 3 I
CodePudding user response:
This will be faster than a loop:
library(dplyr)
dat %>%
group_by(Split_key) %>%
mutate(sub_label2 = strsplit(Split_key[1], "_")[[1]][ match(label, sort(unique(label))) ]) %>%
ungroup()
# # A tibble: 12 x 4
# Split_key label sub_label sub_label2
# <chr> <int> <chr> <chr>
# 1 A_B_C 7 A A
# 2 A_B_C 7 A A
# 3 A_B_C 8 B B
# 4 A_B_C 8 B B
# 5 A_B_C 10 C C
# 6 A_B_C 10 C C
# 7 D_E_F 2 D D
# 8 D_E_F 7 E E
# 9 D_E_F 15 F F
# 10 G_H_I 1 G G
# 11 G_H_I 2 H H
# 12 G_H_I 3 I I
If there are fewer elements encoded within Split_key
than there are distinct values in sub_label
, then you will get NA
for those rows.
Walk-through:
group_by(Split_key)
: since we need to track uniquelabel
for eachSplit_key
, then we group in this field and simplify processing to one group at a time;strsplit(Split_key[1], ")")[[1]]
: within a particular group, we only need one of theSplit_key
values to split, not all of them (since they are all identical), this results internally in a vector such asc("A", "B", "C")
in the first group;match(label, sort(unique(label)))
translates (first group) tomatch(c(7,7,8,8,10,10), c(7,8,10))
which translates toc(1,1,2,2,3,3)
; this is used to index on the vectorc("A","B","C")
from the previous bullet