I have a dataframe that includes various ages and a value for each age. I would like to classify the different ages into broader age groups. To do so, I had to create a fairly complicated nested ifelse statement:

library(tidyverse)

df <- data.frame(age = c("15 to 17", "18 and 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59"),
                 value = sample(1000:2000,10, replace=TRUE))

new_df = df %>% 
  mutate(age_band = 
           ifelse(age %in% c("15 to 17","18 and 19"), '15 to 19', ifelse(age %in% c("20 to 24","25 to 29"), '20s', ifelse(age %in% c("30 to 34","35 to 39"), '30s','40 '))))

Is there any way to do this without using such complex nested statements? The data goes all the way to ages 85 and over, and it got very complicated to classify each different age group.

CodePudding user response：

Following @Claudiu Papasteri recommendations:

Package

library(dplyr)

Solution

df %>% 
  mutate(age_band = case_when( age %in% c("15 to 17","18 and 19") ~  '15 to 19',
                               age %in% c("20 to 24","25 to 29") ~ '20s',
                               age %in% c("30 to 34","35 to 39") ~ '30s',
                               TRUE ~'40 ')
         )

ps: I would include all specific statements in case_when, even the last one for "40 ", this way you can keep track of any problem in your dataset. Because if something is missing, or you have a typo, everything else will be coded as "40 ". So by stating every case you can catch the problem and fix everything before running an analysis or generating graphs. And the last statement could be changed to TRUE ~ age. This means, whatever remains, use the same value, or TRUE ~ NA, that you recode the remaining stuff for missing. So you know that all missing means you have a problem you have to fix.

Output

         age value age_band
1   15 to 17  1432 15 to 19
2  18 and 19  1112 15 to 19
3   20 to 24  1265      20s
4   25 to 29  1076      20s
5   30 to 34  1212      30s
6   35 to 39  1238      30s
7   40 to 44  1384      40 
8   45 to 49  1612      40 
9   50 to 54  1606      40 
10  55 to 59  1897      40

CodePudding user response：

What you can do is take the first two characters to get the first age number of your current group and then just use cut and re-define your breaks and labels.

code

df %>% 
  mutate(age_band = cut(
    as.numeric(substr(age, 1, 2)), 
    breaks = c(15, 20, 30, 40, 100), 
    labels = c("15 to 19", "20s", "30s", "40 "), 
    right = F)
  )

output

         age value  age_band
1   15 to 17  1216  15 to 19
2  18 and 19  1983  15 to 19
3   20 to 24  1839       20s
4   25 to 29  1558       20s
5   30 to 34  1741       30s
6   35 to 39  1171       30s
7   40 to 44  1324       40 
8   45 to 49  1354       40 
9   50 to 54  1342       40 
10  55 to 59  1467       40