Save unique values of variable for each combination of two variables in a dataset-CodePudding

I have a (large) dataset with three variables. For each combination of sub1 and sub2, I would like to save a all unique IVs in a separate vector or dataset, ignoring id, and name it using the variables "sub1.and.sub2.IV". As my dataset is quite large, I would like to avoid using which and automatically extract all combinations.

  id    sub1  sub2  IV   
  <chr> <chr> <chr> <chr>
1 3     a     a     p    
2 3     a     a     f    
3 6     a     b     z    
4 6     a     b     e    
5 7     a     c     b    
6 7     a     c     b

In the end, I would have three vector or datasets:

> a.and.a.IV
[1] "p" "f"
> a.and.b.IV
[1] "z" "e"
> a.and.c.IV
[1] "b"

MRE example:

structure(list(id = c("3", "3", "6", "6", "7", "7"), sub1 = c("a", 
"a", "a", "a", "a", "a"), sub2 = c("a", "a", "b", "b", "c", "c"
), IV = c("p", "f", "z", "e", "b", "b")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

CodePudding user response：

Maybe split

> split(df$IV, df[c("sub1","sub2")])
$a.a
[1] "p" "f"

$a.b
[1] "z" "e"

$a.c
[1] "b" "b"

CodePudding user response：

One possibility could be::

a.and.a.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="a"),]$IV)
a.and.b.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="b"),]$IV)
a.and.c.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="c"),]$IV)

> a.and.a.IV
[1] "p" "f"
> a.and.b.IV
[1] "z" "e"
> a.and.c.IV
[1] "b"

CodePudding user response：

I used @ThomasIsCoding's comment to search for more solutions. I have found 3 solutions to split the dataframe into a list of tibbles and 1 solution using a loop to split a list into dataframes. The for loop stays the same for every solution:

Solution 1: Using a custom made function by @romainfrancois to split and name the data.frames with the corresponding combinations of sub1 and sub2.

library(dplyr, warn.conflicts = FALSE)

named_group_split <- function(.tbl, ...) {
  grouped <- group_by(.tbl, ...)
  names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = " / ")))

  grouped %>% 
    group_split() %>% 
    rlang::set_names(names)
}

df_split1 <- df %>% 
  named_group_split(sub1, sub2) %>%
  unique()

for(i in 1:length(df_split1)) {
  assign(paste0(names(df_split1[i])), as.data.frame(df_split1[[i]]))
}

Solution 2: Using dplyr::group_split to split the dataset into a list with all the original variables and their respective names. Unfortunately, this solution is not able to name the data.frames. Solution found here.

df_split2 <- df %>%
  group_split(sub1, sub2)

for(i in 1:length(df_split2)) {
  assign(paste0(names(df_split2[i])), as.data.frame(df_split2[[i]]))
}

Solution 3: Using base::split allows to split the dataset into a list with just IVs as variable and the for loop.

df_split3 <- split(df$IV, df[c("sub1","sub2")])

for(i in 1:length(df_split3)) {
  assign(paste0(names(df_split3[i])), as.data.frame(df_split3[[i]]))
}