I am working with a panel dataset where individual identifiers are not available. What I have, instead, is an identifier for households fixed across waves (variable nquest), an identifier for individuals within each household changing across waves (variable nord) and an identifier for individuals referring to the value of the variable nord in the previous wave (variable nordp). So, a simplified dataframe would look like this:
row | Year | nquest | nord | nordp |
---|---|---|---|---|
1 | 2010 | 1 | 1 | NA |
2 | 2012 | 1 | 1 | 1 |
3 | 2012 | 1 | 2 | NA |
4 | 2012 | 1 | 3 | NA |
5 | 2014 | 1 | 1 | 1 |
6 | 2014 | 1 | 2 | 3 |
7 | 2010 | 2 | 1 | NA |
8 | 2010 | 2 | 2 | NA |
9 | 2012 | 2 | 1 | 1 |
10 | 2012 | 2 | 2 | 2 |
11 | 2012 | 2 | 3 | NA |
In this example, rows 1, 2, and 5 refer to the same individual; as well as 4 and 6, 7 and 9, and 8 and 10 (again, nordp refers to nord in the previous wave, whereas nquest is fixed).
Is there a quick way to come up with a variable identifying the same individual across waves? I've tried with for loops creating complicated labels and extracting characters across waves but it takes ages (the dataset is large, but still) and I'm sure there is an easier way, with dplyr or something. Differently from other solutions I've found on the website, I would need to do that for every single observation - so without specifying values for nord or nordp or nquest.
Thank you!
EDIT: Ideally, I would come up with a table like this:
row | Year | nquest | nord | nordp | id |
---|---|---|---|---|---|
1 | 2010 | 1 | 1 | NA | 1 |
2 | 2012 | 1 | 1 | 1 | 1 |
3 | 2012 | 1 | 2 | NA | 2 |
4 | 2012 | 1 | 3 | NA | 3 |
5 | 2014 | 1 | 1 | 1 | 1 |
6 | 2014 | 1 | 2 | 3 | 3 |
7 | 2010 | 2 | 1 | NA | 4 |
8 | 2010 | 2 | 2 | NA | 5 |
9 | 2012 | 2 | 1 | 1 | 4 |
10 | 2012 | 2 | 2 | 2 | 5 |
11 | 2012 | 2 | 3 | NA | 6 |
CodePudding user response:
You could try the following dplyr
approach, which uses mutate
to make a new id
column, and case_when
to implement the logic.
As a best practice (coming from someone who makes a lot of identifiers in their data), instead of making the IDs sequential it may be better to concatenate the family id (nquest
) with the individual family member ID. This would allow the simultaneous identification of both the family and the individual. For convenience purposes, it is also better to make IDs alphanumeric - so the below outputs an ID "Kx-x" where the first integer is the family ID and the second is the individual ID. The letter K is arbitrary.
library(dplyr)
new_df <- df %>%
group_by(nquest) %>%
mutate(id = case_when(
nord == 1 | nordp == 1 ~ 1,
is.na(nordp) ~ as.numeric(nord),
TRUE ~ as.numeric(nordp)
),
id = paste0("K",nquest,"-" id))
Output:
# row Year nquest nord nordp id
# <int> <int> <int> <int> <int> <chr>
# 1 1 2010 1 1 NA K1-1
# 2 2 2012 1 1 1 K1-1
# 3 3 2012 1 2 NA K1-2
# 4 4 2012 1 3 NA K1-3
# 5 5 2014 1 1 1 K1-1
# 6 6 2014 1 2 3 K1-3
# 7 7 2010 2 1 NA K2-1
# 8 8 2010 2 2 NA K2-2
# 9 9 2012 2 1 1 K2-1
# 10 10 2012 2 2 2 K2-2
# 11 11 2012 2 3 NA K2-3
Data:
df <- read.table(text = "row Year nquest nord nordp
1 2010 1 1 NA
2 2012 1 1 1
3 2012 1 2 NA
4 2012 1 3 NA
5 2014 1 1 1
6 2014 1 2 3
7 2010 2 1 NA
8 2010 2 2 NA
9 2012 2 1 1
10 2012 2 2 2
11 2012 2 3 NA", header = TRUE)
CodePudding user response:
Here is an option
library(purrr)
library(dplyr)
library(stringr)
df1 %>%
mutate(id = as.integer(factor(pmap_chr(across(nquest:nordp),
~ {v1 <- unique(na.omit(c(...)));str_c(first(v1), last(v1))}))))
-output
row Year nquest nord nordp id
1 1 2010 1 1 NA 1
2 2 2012 1 1 1 1
3 3 2012 1 2 NA 2
4 4 2012 1 3 NA 3
5 5 2014 1 1 1 1
6 6 2014 1 2 3 3
7 7 2010 2 1 NA 4
8 8 2010 2 2 NA 5
9 9 2012 2 1 1 4
10 10 2012 2 2 2 5
11 11 2012 2 3 NA 6
data
df1 <- structure(list(row = 1:11, Year = c(2010L, 2012L, 2012L, 2012L,
2014L, 2014L, 2010L, 2010L, 2012L, 2012L, 2012L), nquest = c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), nord = c(1L, 1L, 2L,
3L, 1L, 2L, 1L, 2L, 1L, 2L, 3L), nordp = c(NA, 1L, NA, NA, 1L,
3L, NA, NA, 1L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))