Home > database >  Creating factor from multiple other factors fast
Creating factor from multiple other factors fast

Time:12-17

I have a data frame that looks like this:

df <- data.frame(
  id = c(1, 2, 3, 4, 5),
  generation = as.factor(c(3, 2, 4, 3, 4)),
  income = as.factor(c(4, 3, 3, 7, 3)),
  fem = as.factor(c(0, 0, 1, 0, 1))
)

where id is an identifier for individuals in the data set and generation, income and fem are categorical characteristics of the individuals. Now, I want to put the individuals into cohorts ("groups") based on the individual characteristics, where individuals with the exact same values for the individual characteristics should get the same cohort_id. Hence, I want the following result:

data.frame(
  id = c(1, 2, 3, 4, 5),
  generation = as.factor(c(3, 2, 4, 3, 4)),
  income = as.factor(c(4, 3, 3, 7, 3)),
  fem = as.factor(c(0, 0, 1, 0, 1)),
  cohort_id = as.factor(c(1, 2, 3, 4, 3))
)

Note that id = 3 and id = 5 get the same cohort_id as they have the same characteristcs.

My question is whether there is a fast way to create the cohort_ids without using multiple case_when or ifelse over and over again? This can get quite tedious if you want to build many cohorts. A solution using dplyr would be nice but is not necessary.

CodePudding user response:

There are multiple ways to do this - one option is to paste the columns and match with the unique values

library(dplyr)
library(stringr)
df %>%
     mutate(cohort_id = str_c(generation, income, fem), 
            cohort_id = match(cohort_id, unique(cohort_id)))

-output

 id generation income fem cohort_id
1  1          3      4   0         1
2  2          2      3   0         2
3  3          4      3   1         3
4  4          3      7   0         4
5  5          4      3   1         3

CodePudding user response:

The following code will create an index 'cohort_id' with values a little different from the provided expected, but compliant with the grouping rules:

library(dplyr)

df %>% group_by(generation, income, fem) %>%
    mutate(cohort_id = cur_group_id())%>%
    ungroup()

# A tibble: 5 × 5
     id generation income fem   cohort_id
  <dbl> <fct>      <fct>  <fct>     <int>
1     1 3          4      0             2
2     2 2          3      0             1
3     3 4          3      1             4
4     4 3          7      0             3
5     5 4          3      1             4
  • Related