Home > Mobile >  Holes in ordered groups
Holes in ordered groups

Time:11-17

I have a data frame ordered by id and year, observed n times over a number of years. Number of observations per individual per year is irregular. I define a "hole" in the data as an observation where x2=1 and the observation immediatly above, for the same id (not necessarily for the sameyear), is equal to 0. For example, individual A has a hole in 2002. When this happens, I need to create a variable where I store the value of x1 immediatly above, for which x2=0. In the example of individual A, I would then need the new variable to equal 5 when x2=1.

x1 = c(5,3,2,2,5,7,7,3,4,8)
x2 = c(0,1,0,1,0,1,0,1,0,1)
id = c("A","A","A","B","B","C","C","C","D","D")
year = c(2001,2002,2003,2001,2002,2001,2001,2002,2001,2002)

df = data.frame(year,id,x1,x2)

Considering this sample data frame, I would need the new variable to look like this:

outcome = c(.,5,.,.,.,.,.,7,.,4)

The dataset I'm working with has close to 10.000.000 observations, for 3.000.000 individuals over 4 years, so I can't do this manually. Is there any generalized way to achieve this that works with any dataset, regardless of dimension?

I went through a few posts here using for loops to iterate over groups (one example was this one Iterating a for loop over groups in a dataset) but I wasn't able to apply any of it. I've been trying to do it in R after being unsuccessful in stata 14. I wasn't able to find any post that applied to ordered groups, which is what I'm looking for.

CodePudding user response:

Here's a simple way to get your outcome with dplyr.

library(dplyr)

df %>% 
  group_by(id) %>% 
  mutate(
    outcome = ifelse(x2 == 1 & lag(x2) == 0, lag(x1), NA)
  )

Result

# A tibble: 10 × 5
# Groups:   id [4]
    year id       x1    x2 outcome
   <dbl> <chr> <dbl> <dbl>   <dbl>
 1  2001 A         5     0      NA
 2  2002 A         3     1       5
 3  2003 A         2     0      NA
 4  2001 B         2     1      NA
 5  2002 B         5     0      NA
 6  2001 C         7     1      NA
 7  2001 C         7     0      NA
 8  2002 C         3     1       7
 9  2001 D         4     0      NA
10  2002 D         8     1       4
  • Related