I am working with a dataset that looks like this...
group col_2 col_3 col_4
A TT 12 21
A RR 11 21
A LL 13 22
A QQ 11 24
A PP 14 25
A RR 15 26
A TT 17 28
A LL 16 29
B DD 12 23
B QQ 14 23
B PP 13 25
B HH 11 25
B LL 15 26
B DD 17 28
B QQ 14 29
B HH 13 30
C MM 18 21
C JJ 15 22
C LL 17 23
C NN 14 24
C EE 19 25
C KK 15 28
C NN 17 28
C UU 10 29
D II 14 21
D OO 15 23
D PP 16 24
D LL 17 25
D MM 18 26
D AA 10 28
D HH 12 29
D JJ 13 30
So, I need to create a new data frame grouping by the group column and using the values of col_4.
As you can see range of values in col_4 goes from 21-30. I need to select three values for each group. One value is supposed to be on the range from 21-22, the second value is supposed to be on the range from 25-26 and the third values needs to be a value on the range from 29-30. If there are two possibilities I need to randomly select one values and if there are no possibilities I need the output to be NA. For example, you can see that group A has three possible outputs for the first range with values in col_4 of 21, 21, 22. Then I need to select just one, randomly. You can also see that group B doe not has any values in column_4 between 21-22, so I need the output to be NA.
The second important issue is that I need to have only three rows per group. So I want my data (the output) to look like this.
group col_2 col_3 col_4 range_2122 group col_2 col_3 col_4 range_2526 group col_2 col_3 col_4 range2930
A TT 12 21 21 A RR 15 26 26 A LL 16 29 29
B NA NA NA NA B HH 11 25 25 B HH 13 30 30
C etc.
D etc.
A second option is to get an output like this...
group col_2 col_3 col_4 range
A TT 12 21 21
A RR 15 26 26
A LL 16 29 29
B NA NA NA NA
B HH 11 25 25
B HH 13 30 30
C etc..
CodePudding user response:
The below comes fairly close to what you want using dplyr
and tidyr
. As has been mentioned, the example output you gave has non-unique names which are not a good idea.
Hopefully you can use this as a starting point.
library(dplyr)
df %>%
dplyr::mutate(grouping_col = case_when(
col_4 %in% 21:22 ~ "range_2122",
col_4 %in% 25:26 ~ "range_2526",
col_4 %in% 29:30 ~ "range_2930",
TRUE ~ NA_character_
)) %>%
# Get all combinations of group and range (ensures NAs where missing)
right_join(tidyr::expand(., group, grouping_col), by = c("grouping_col", "group")) %>%
filter(!is.na(grouping_col)) %>%
# Group back and randomly select a row
group_by(group, grouping_col) %>%
slice_sample(n = 1)
CodePudding user response:
Here's a dplyr suggestion:
library(dplyr)
# library(purrr) # map_dfc
set.seed(42)
dat %>%
group_by(group) %>%
summarize(purrr::map_dfc(L, function(z) {
out <- filter(cur_data(), between(col_4, z[1], z[2])) %>%
slice(sample(n(), 1))
if (!nrow(out)) out <- out[NA,]
out[[ paste0("range_", paste(z, collapse = "")) ]] <- out$col_4
out
})) %>%
ungroup()
# New names:
# * col_2 -> col_2...1
# * col_3 -> col_3...2
# * col_4 -> col_4...3
# * col_2 -> col_2...5
# * col_3 -> col_3...6
# * ...
# New names:
# * col_2 -> col_2...1
# * col_3 -> col_3...2
# * col_4 -> col_4...3
# * col_2 -> col_2...5
# * col_3 -> col_3...6
# * ...
# New names:
# * col_2 -> col_2...1
# * col_3 -> col_3...2
# * col_4 -> col_4...3
# * col_2 -> col_2...5
# * col_3 -> col_3...6
# * ...
# New names:
# * col_2 -> col_2...1
# * col_3 -> col_3...2
# * col_4 -> col_4...3
# * col_2 -> col_2...5
# * col_3 -> col_3...6
# * ...
# # A tibble: 3 x 13
# group col_2...1 col_3...2 col_4...3 range_2122 col_2...5 col_3...6 col_4...7 range_2526 col_2...9 col_3...10 col_4...11 range_2930
# <chr> <chr> <int> <int> <int> <chr> <int> <int> <int> <chr> <int> <int> <int>
# 1 A TT 12 21 21 PP 14 25 25 LL 16 29 29
# 2 C JJ 15 22 22 EE 19 25 25 UU 10 29 29
# 3 D II 14 21 21 LL 17 25 25 JJ 13 30 30