Easy and quick loop to extract values from another row in R, based on condition?-CodePudding

I am working with a panel dataset where individual identifiers are not available. What I have, instead, is an identifier for households fixed across waves (variable nquest), an identifier for individuals within each household changing across waves (variable nord) and an identifier for individuals referring to the value of the variable nord in the previous wave (variable nordp). So, a simplified dataframe would look like this:

row	Year	nquest	nord	nordp
1	2010	1	1	NA
2	2012	1	1	1
3	2012	1	2	NA
4	2012	1	3	NA
5	2014	1	1	1
6	2014	1	2	3
7	2010	2	1	NA
8	2010	2	2	NA
9	2012	2	1	1
10	2012	2	2	2
11	2012	2	3	NA

In this example, rows 1, 2, and 5 refer to the same individual; as well as 4 and 6, 7 and 9, and 8 and 10 (again, nordp refers to nord in the previous wave, whereas nquest is fixed).

Is there a quick way to come up with a variable identifying the same individual across waves? I've tried with for loops creating complicated labels and extracting characters across waves but it takes ages (the dataset is large, but still) and I'm sure there is an easier way, with dplyr or something. Differently from other solutions I've found on the website, I would need to do that for every single observation - so without specifying values for nord or nordp or nquest.

Thank you!

EDIT: Ideally, I would come up with a table like this:

row	Year	nquest	nord	nordp	id
1	2010	1	1	NA	1
2	2012	1	1	1	1
3	2012	1	2	NA	2
4	2012	1	3	NA	3
5	2014	1	1	1	1
6	2014	1	2	3	3
7	2010	2	1	NA	4
8	2010	2	2	NA	5
9	2012	2	1	1	4
10	2012	2	2	2	5
11	2012	2	3	NA	6

CodePudding user response：

You could try the following dplyr approach, which uses mutate to make a new id column, and case_when to implement the logic.

As a best practice (coming from someone who makes a lot of identifiers in their data), instead of making the IDs sequential it may be better to concatenate the family id (nquest) with the individual family member ID. This would allow the simultaneous identification of both the family and the individual. For convenience purposes, it is also better to make IDs alphanumeric - so the below outputs an ID "Kx-x" where the first integer is the family ID and the second is the individual ID. The letter K is arbitrary.

library(dplyr)

new_df <- df %>%
  group_by(nquest) %>%
  mutate(id = case_when(
    nord == 1 | nordp == 1 ~ 1,
    is.na(nordp) ~ as.numeric(nord),
    TRUE ~ as.numeric(nordp)
  ),
  id = paste0("K",nquest,"-" id))

Output:

#      row  Year nquest  nord nordp id   
#    <int> <int>  <int> <int> <int> <chr>
#  1     1  2010      1     1    NA K1-1 
#  2     2  2012      1     1     1 K1-1 
#  3     3  2012      1     2    NA K1-2 
#  4     4  2012      1     3    NA K1-3 
#  5     5  2014      1     1     1 K1-1 
#  6     6  2014      1     2     3 K1-3 
#  7     7  2010      2     1    NA K2-1 
#  8     8  2010      2     2    NA K2-2 
#  9     9  2012      2     1     1 K2-1 
# 10    10  2012      2     2     2 K2-2 
# 11    11  2012      2     3    NA K2-3

Data:

df <- read.table(text = "row    Year    nquest  nord    nordp
1   2010    1   1   NA
2   2012    1   1   1
3   2012    1   2   NA
4   2012    1   3   NA
5   2014    1   1   1
6   2014    1   2   3
7   2010    2   1   NA
8   2010    2   2   NA
9   2012    2   1   1
10  2012    2   2   2
11  2012    2   3   NA", header = TRUE)

CodePudding user response：

Here is an option

library(purrr)
library(dplyr)
library(stringr)
df1 %>% 
  mutate(id = as.integer(factor(pmap_chr(across(nquest:nordp),
   ~  {v1 <- unique(na.omit(c(...)));str_c(first(v1), last(v1))}))))

-output

   row Year nquest nord nordp id
1    1 2010      1    1    NA  1
2    2 2012      1    1     1  1
3    3 2012      1    2    NA  2
4    4 2012      1    3    NA  3
5    5 2014      1    1     1  1
6    6 2014      1    2     3  3
7    7 2010      2    1    NA  4
8    8 2010      2    2    NA  5
9    9 2012      2    1     1  4
10  10 2012      2    2     2  5
11  11 2012      2    3    NA  6

data

df1 <- structure(list(row = 1:11, Year = c(2010L, 2012L, 2012L, 2012L, 
2014L, 2014L, 2010L, 2010L, 2012L, 2012L, 2012L), nquest = c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), nord = c(1L, 1L, 2L, 
3L, 1L, 2L, 1L, 2L, 1L, 2L, 3L), nordp = c(NA, 1L, NA, NA, 1L, 
3L, NA, NA, 1L, 2L, NA)), class = "data.frame", row.names = c(NA, 
-11L))