I have a wide-format longitudinal dataset with a set of variables that indicate what state a participant lived in for each year of the study period. If no participant lived in a given state that year, there is no level for that state in the variable. For example, using a simplified version in which the dataset contains participants from New England states (MA, CT, RI, VT, NH, ME) only:
ID <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
state_2000 <- c("MA", "MA", "RI", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
state_2002 <- c("MA", "MA", "RI", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
# participant # 3 moves from RI to MA; RI no longer a level in the subsequent state variables
state_2004 <- c("MA", "MA", "MA", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
state_2006 <- c("MA", "MA", "MA", "VT", "NH", "NH", "ME", "CT", "CT", "ME")
df <- data.frame(ID, state_2000, state_2002, state_2004, state_2006)
print (df)
ID state_2000 state_2002 state_2004 state_2006
1 1 MA MA MA MA
2 2 MA MA MA MA
3 3 RI RI MA MA
4 4 VT VT VT VT
5 5 NH NH NH NH
6 6 NH NH NH NH
7 7 ME ME ME ME
8 8 CT CT CT CT
9 9 CT CT CT CT
10 10 ME ME ME ME
table(df$state_2002, useNA = "always")
CT MA ME NH RI VT <NA>
2 2 2 2 1 1 0
table(df$state_2004, useNA = "always")
CT MA ME NH VT <NA>
2 3 2 2 1 0
I want to create a set of new state variables that have categories for each state (where categories for states where no one lives in that year would be missing), perhaps by using some combination of mutate(), across(), starts_with(), and case_when(). I've tried something like:
df <-
df %>%
mutate(across(starts_with("state_"), as.factor(case_when(
starts_with("state_")=="CT" ~ 1,
starts_with("state_")=="MA" ~ 2,
starts_with("state_")=="ME" ~ 3,
starts_with("state_")== "NH" ~ 4,
starts_with("state_")=="RI" ~ 5,
starts_with("state_")=="VT" ~ 6,
TRUE ~ NA_real_))))
However, this doesn't seem to work, as I get errors like:
Error in `mutate()`:
! Problem while computing `..1 = across(...)`.
Caused by error:
! attempt to select less than one element in integerOneIndex
Does anyone know how to do this? Thank you so much!
CodePudding user response:
You don't need case_when
here.
Create a vector of unique
lev
els in your dataset, and then you can create multiple columns with across
. Convert your columns to factor
with the same lev
els and then use as.numeric
to convert them to numbers: they will all share the same numbering.
lev = unique(unlist(df[-1]))
df %>%
mutate(across(starts_with("state_"), ~ as.numeric(factor(.x, levels = lev)),
.names = "{col}_new"))
ID state_2000 state_2002 state_2004 state_2006 state_2000_new state_2002_new state_2004_new state_2006_new
1 1 MA MA MA MA 1 1 1 1
2 2 MA MA MA MA 1 1 1 1
3 3 RI RI MA MA 2 2 1 1
4 4 VT VT VT VT 3 3 3 3
5 5 NH NH NH NH 4 4 4 4
6 6 NH NH NH NH 4 4 4 4
7 7 ME ME ME ME 5 5 5 5
8 8 CT CT CT CT 6 6 6 6
9 9 CT CT CT CT 6 6 6 6
10 10 ME ME ME ME 5 5 5 5
CodePudding user response:
Use ~
and .
to create a function
df <- df %>%
mutate(across(starts_with("state_"), ~ as.factor(case_when(
. =="CT" ~ 1,
. =="MA" ~ 2,
. =="ME" ~ 3,
. == "NH" ~ 4,
. =="RI" ~ 5,
. =="VT" ~ 6,
TRUE ~ NA_real_)),
.names = "{.col}_num")) # Remove this argument if you don't want to retain the initial columns
df
ID state_2000 state_2002 state_2004 state_2006 state_2000_num state_2002_num state_2004_num
1 1 MA MA MA MA 2 2 2
2 2 MA MA MA MA 2 2 2
3 3 RI RI MA MA 5 5 2
4 4 VT VT VT VT 6 6 6
5 5 NH NH NH NH 4 4 4
6 6 NH NH NH NH 4 4 4
7 7 ME ME ME ME 3 3 3
8 8 CT CT CT CT 1 1 1
9 9 CT CT CT CT 1 1 1
10 10 ME ME ME ME 3 3 3
state_2006_num
1 2
2 2
3 2
4 6
5 4
6 4
7 3
8 1
9 1
10 3