I have a dataframe that includes various ages and a value for each age. I would like to classify the different ages into broader age groups. To do so, I had to create a fairly complicated nested ifelse statement:
library(tidyverse)
df <- data.frame(age = c("15 to 17", "18 and 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59"),
value = sample(1000:2000,10, replace=TRUE))
new_df = df %>%
mutate(age_band =
ifelse(age %in% c("15 to 17","18 and 19"), '15 to 19', ifelse(age %in% c("20 to 24","25 to 29"), '20s', ifelse(age %in% c("30 to 34","35 to 39"), '30s','40 '))))
Is there any way to do this without using such complex nested statements? The data goes all the way to ages 85 and over, and it got very complicated to classify each different age group.
CodePudding user response:
Following @Claudiu Papasteri recommendations:
Package
library(dplyr)
Solution
df %>%
mutate(age_band = case_when( age %in% c("15 to 17","18 and 19") ~ '15 to 19',
age %in% c("20 to 24","25 to 29") ~ '20s',
age %in% c("30 to 34","35 to 39") ~ '30s',
TRUE ~'40 ')
)
ps: I would include all specific statements in case_when
, even the last one for "40 ", this way you can keep track of any problem in your dataset. Because if something is missing, or you have a typo, everything else will be coded as "40 ". So by stating every case you can catch the problem and fix everything before running an analysis or generating graphs. And the last statement could be changed to TRUE ~ age
. This means, whatever remains, use the same value, or TRUE ~ NA
, that you recode the remaining stuff for missing. So you know that all missing means you have a problem you have to fix.
Output
age value age_band
1 15 to 17 1432 15 to 19
2 18 and 19 1112 15 to 19
3 20 to 24 1265 20s
4 25 to 29 1076 20s
5 30 to 34 1212 30s
6 35 to 39 1238 30s
7 40 to 44 1384 40
8 45 to 49 1612 40
9 50 to 54 1606 40
10 55 to 59 1897 40
CodePudding user response:
What you can do is take the first two characters to get the first age number of your current group and then just use cut and re-define your breaks and labels.
code
df %>%
mutate(age_band = cut(
as.numeric(substr(age, 1, 2)),
breaks = c(15, 20, 30, 40, 100),
labels = c("15 to 19", "20s", "30s", "40 "),
right = F)
)
output
age value age_band
1 15 to 17 1216 15 to 19
2 18 and 19 1983 15 to 19
3 20 to 24 1839 20s
4 25 to 29 1558 20s
5 30 to 34 1741 30s
6 35 to 39 1171 30s
7 40 to 44 1324 40
8 45 to 49 1354 40
9 50 to 54 1342 40
10 55 to 59 1467 40