Home > OS >  Is it possible to simultaneously create many new variables using mutate(), across(), starts_with(),
Is it possible to simultaneously create many new variables using mutate(), across(), starts_with(),

Time:09-07

I have a wide-format longitudinal dataset with a set of variables that indicate what state a participant lived in for each year of the study period. If no participant lived in a given state that year, there is no level for that state in the variable. For example, using a simplified version in which the dataset contains participants from New England states (MA, CT, RI, VT, NH, ME) only:

ID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
state_2000 <- c("MA", "MA", "RI", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
state_2002 <- c("MA", "MA", "RI", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
# participant # 3 moves from RI to MA; RI no longer a level in the subsequent state variables
state_2004 <- c("MA", "MA", "MA", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
state_2006 <- c("MA", "MA", "MA", "VT", "NH", "NH", "ME", "CT", "CT", "ME")

df <- data.frame(ID, state_2000, state_2002, state_2004, state_2006)

print (df)
   ID state_2000 state_2002 state_2004 state_2006
1   1         MA         MA         MA         MA
2   2         MA         MA         MA         MA
3   3         RI         RI         MA         MA
4   4         VT         VT         VT         VT
5   5         NH         NH         NH         NH
6   6         NH         NH         NH         NH
7   7         ME         ME         ME         ME
8   8         CT         CT         CT         CT
9   9         CT         CT         CT         CT
10 10         ME         ME         ME         ME

table(df$state_2002, useNA = "always")
  CT   MA   ME   NH   RI   VT <NA> 
   2    2    2    2    1    1    0 

table(df$state_2004, useNA = "always")
  CT   MA   ME   NH   VT <NA> 
   2    3    2    2    1    0 

I want to create a set of new state variables that have categories for each state (where categories for states where no one lives in that year would be missing), perhaps by using some combination of mutate(), across(), starts_with(), and case_when(). I've tried something like:

df <- 
  df %>%
  mutate(across(starts_with("state_"), as.factor(case_when( 
    starts_with("state_")=="CT" ~ 1,
    starts_with("state_")=="MA" ~ 2,
    starts_with("state_")=="ME" ~ 3,
    starts_with("state_")== "NH" ~ 4,
    starts_with("state_")=="RI" ~ 5,
    starts_with("state_")=="VT" ~ 6,
    TRUE ~ NA_real_)))) 

However, this doesn't seem to work, as I get errors like:

Error in `mutate()`:
! Problem while computing `..1 = across(...)`.
Caused by error:
! attempt to select less than one element in integerOneIndex

Does anyone know how to do this? Thank you so much!

CodePudding user response:

You don't need case_when here.

Create a vector of unique levels in your dataset, and then you can create multiple columns with across. Convert your columns to factor with the same levels and then use as.numeric to convert them to numbers: they will all share the same numbering.

lev = unique(unlist(df[-1]))
df %>% 
  mutate(across(starts_with("state_"), ~ as.numeric(factor(.x, levels = lev)),
         .names = "{col}_new"))
   ID state_2000 state_2002 state_2004 state_2006 state_2000_new state_2002_new state_2004_new state_2006_new
1   1         MA         MA         MA         MA              1              1              1              1
2   2         MA         MA         MA         MA              1              1              1              1
3   3         RI         RI         MA         MA              2              2              1              1
4   4         VT         VT         VT         VT              3              3              3              3
5   5         NH         NH         NH         NH              4              4              4              4
6   6         NH         NH         NH         NH              4              4              4              4
7   7         ME         ME         ME         ME              5              5              5              5
8   8         CT         CT         CT         CT              6              6              6              6
9   9         CT         CT         CT         CT              6              6              6              6
10 10         ME         ME         ME         ME              5              5              5              5

CodePudding user response:

Use ~ and . to create a function

df <- df %>%
  mutate(across(starts_with("state_"), ~ as.factor(case_when( 
    . =="CT" ~ 1,
    . =="MA" ~ 2,
    . =="ME" ~ 3,
    . == "NH" ~ 4,
    . =="RI" ~ 5,
    . =="VT" ~ 6,
    TRUE ~ NA_real_)), 
    .names = "{.col}_num")) # Remove this argument if you don't want to retain the initial columns

df
   ID state_2000 state_2002 state_2004 state_2006 state_2000_num state_2002_num state_2004_num
1   1         MA         MA         MA         MA              2              2              2
2   2         MA         MA         MA         MA              2              2              2
3   3         RI         RI         MA         MA              5              5              2
4   4         VT         VT         VT         VT              6              6              6
5   5         NH         NH         NH         NH              4              4              4
6   6         NH         NH         NH         NH              4              4              4
7   7         ME         ME         ME         ME              3              3              3
8   8         CT         CT         CT         CT              1              1              1
9   9         CT         CT         CT         CT              1              1              1
10 10         ME         ME         ME         ME              3              3              3
   state_2006_num
1               2
2               2
3               2
4               6
5               4
6               4
7               3
8               1
9               1
10              3
  • Related