Home > OS >  How to ensure the setNames() attributes the correct names to my group_split() datasets when splittin
How to ensure the setNames() attributes the correct names to my group_split() datasets when splittin

Time:02-19

As a follow up to this question, I'm using dplyr's group_split() to make dataframes / tibbles based on a levels of a column. Continuing off of this question, I want to split off of two columns instead of 1. When I try to split and name the columns, it attributes the wrong names to some of the datasets.

Here's a simple example:

library(dplyr)
#Sample dataset to intuitively illustrate issue
example <- tibble(number = c(1:6),
                  even_or_odd = c("odd", "even", "odd", "even", "odd", "even"),
                  prime_or_not = c("prime", "prime", "prime", "not", "prime", "not")) %>%
  mutate(type = paste0(even_or_odd, "_", prime_or_not)) %>%
  mutate(type_factor = factor(type, levels = unique(type)))

#Does group split to make 3 datasets
the_test <- example %>%
  group_split(even_or_odd, prime_or_not) %>%
  setNames(unique(example$type_factor))

#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #wrong label :`-(
odd_prime <- the_test["odd_prime"]$odd_prime #wrong label :`-(
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!

My question: how do I ensure that my group names will be attributed to the right dataset and avoid the issues here with even_not and odd_prime being mixed up?

In my actual dataset, I have 50 combinations, so typing them all out manually is not an option. In addition, my actual dataset will have some combinations that don't consistently exist (like the (like the odd not prime combination here), so relying on index isn't an option.

CodePudding user response:

Instead of splitting by the two columns, use the factor column that was created, which ensures that it splits by the order of the levels created in the type_factor. In addition, using the unique on type_factor can have some issues if the order of the values in 'type_factor' is different i.e. unique gets the first non-duplicated value based on its occurrence. Instead, levels is better. In fact, it may be more appropriate to droplevels as well in case of unused levels.

the_test <- example %>%
  group_split(type_factor) %>% 
  setNames(levels(example$type_factor))

group_split returns unnamed list. If we want to avoid the pain of renaming incorrectly, use split from base R which does return a named list. Thus, it can return in any order as long as the key/value pairs are correct

# 1 - return in a different order based on alphabetic order
split(example, example[c("even_or_odd", "prime_or_not")], drop = TRUE)

# 2 - return order based on the levels of the factor column
split(example, example$type_factor)
# 3 - With dplyr pipe
example %>% 
       split(.$type_factor)
# 4 - or using magrittr exposition operator
library(magrittr)
example %$%
      split(x = ., f = type_factor)

CodePudding user response:

Oh, of course the moment I post it, I realize that an easy solution existed:

Just change the group split to the new variable and it works!

library(dplyr)

#Does group split to make 3 datasets
the_test <- example %>%
  group_split(type_factor) %>%
  setNames(unique(example$type_factor))

#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #works now!
odd_prime <- the_test["odd_prime"]$odd_prime #works now!
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!

  • Related