Home > Software engineering >  Writing a mutate, case_when function in R with variable amount of cases to check
Writing a mutate, case_when function in R with variable amount of cases to check

Time:10-27

I am very new to R so please bear with me, I've spent a good few hours trying to figure this out so figured it's about time to ask for help.

I am currently trying to clean up a dataset part of which requires me to group answers from one column into a new column.

For example when "france" is matched in a "country" column, it would add "europe" in a new "continent" column.

Basically this:

replicatable code

# a handful of countries to sort
df = data.frame(country = c('france','england','usa','poland','brazil','kenya','canada','england', 'usa', 'france'))

# simplified vectors for each continent
europe <- c('england','france','poland')
north_america <- c('usa','canada')
south_america <- c('brazil')
africa <- c('kenya')

# the grouping
df_updated <- df %>%
  mutate(across(country, ~ case_when(. %in% europe ~ 'europe',
                                   . %in% north_america ~ 'north america',
                                   . %in% south_america ~ 'south america',
                                   . %in% africa ~ 'africa'),.names = 'region'))

This works great. However I have to do this type of grouping across dozens of different categories in many datasets. I know it's not good practice to just copy and paste huge chunks of code so I am instead trying to write a function to do this.

So I have added the following:

the function

country_list <- list(europe, north_america, south_america, africa) # a list of the 4 region vectors 
country_cat <- c('europe', 'north america', 'south america', 'africa') # a vector of corosponding labels for the categories

grouping_func <- function(dataframe, name, data, list, category) {
  dataframe %>%
    mutate(across(!!sym(data), ~ case_when(. %in% list[[1]] ~ category[1],
                                           . %in% list[[2]] ~ category[2],
                                           . %in% list[[3]] ~ category[3],
                                           . %in% list[[4]] ~ category[4]), .names = '{name}'))
}

df_updated2 <- grouping_func(df, 'continent', 'country', country_list, country_cat)

This took a bit of playing around - realising I couldn't search through a vector of vectors etc. but it works great.

The problem

This brings me to my issue. Not all variables I want to categorise are going to be the same size.

For example there are 7 continents but only 4 US regions, or 12 timezones, or 10 colors of fruits or whatever I need to categorize.

Which means I need to find a way of iterating through my list/category based on the length of the list.

For example if I had to pipe the following into my function it would break as the function at this point is hard coded to work through lists of 4 categories:

morning <- c(0:11)
afternoon <- c(12:18)
evening <- c(19:23)
time_list <- list(morning, afternoon, evening)
time_cat <- c('morning', 'afternoon', 'evening')

I've tried using a for loop in various ways and also tried to figure out how using lapply might help but with both I've hit a brick wall. I don't even know if I got particularly close to be honest. Based on any keywords I can think of I've read everything I can find on google and SO but part of me is wondering if my lack of experience means I don't even know what I need to be looking for as I'm really coming up with nothing.

Could someone give me a pointer on what I'm looking for and how best to proceed with this? I'm really keen to learn but I'm about 4hrs into this problem now and no further forward than when I started

  • Related