I have a data frame that looks like this:
df <- data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1))
)
where id
is an identifier for individuals in the data set and generation
, income
and fem
are categorical characteristics of the individuals. Now, I want to put the individuals into cohorts ("groups") based on the individual characteristics, where individuals with the exact same values for the individual characteristics should get the same cohort_id
. Hence, I want the following result:
data.frame(
id = c(1, 2, 3, 4, 5),
generation = as.factor(c(3, 2, 4, 3, 4)),
income = as.factor(c(4, 3, 3, 7, 3)),
fem = as.factor(c(0, 0, 1, 0, 1)),
cohort_id = as.factor(c(1, 2, 3, 4, 3))
)
Note that id
= 3 and id
= 5 get the same cohort_id
as they have the same characteristcs.
My question is whether there is a fast way to create the cohort_id
s without using multiple case_when
or ifelse
over and over again? This can get quite tedious if you want to build many cohorts. A solution using dplyr
would be nice but is not necessary.
CodePudding user response:
There are multiple ways to do this - one option is to paste
the columns and match
with the unique
values
library(dplyr)
library(stringr)
df %>%
mutate(cohort_id = str_c(generation, income, fem),
cohort_id = match(cohort_id, unique(cohort_id)))
-output
id generation income fem cohort_id
1 1 3 4 0 1
2 2 2 3 0 2
3 3 4 3 1 3
4 4 3 7 0 4
5 5 4 3 1 3
CodePudding user response:
The following code will create an index 'cohort_id' with values a little different from the provided expected, but compliant with the grouping rules:
library(dplyr)
df %>% group_by(generation, income, fem) %>%
mutate(cohort_id = cur_group_id())%>%
ungroup()
# A tibble: 5 × 5
id generation income fem cohort_id
<dbl> <fct> <fct> <fct> <int>
1 1 3 4 0 2
2 2 2 3 0 1
3 3 4 3 1 4
4 4 3 7 0 3
5 5 4 3 1 4