I have a data frame ordered by id
and year
, observed n times over a number of years. Number of observations per individual per year is irregular. I define a "hole" in the data as an observation where x2=1
and the observation immediatly above, for the same id
(not necessarily for the sameyear
), is equal to 0. For example, individual A has a hole in 2002. When this happens, I need to create a variable where I store the value of x1
immediatly above, for which x2=0
. In the example of individual A, I would then need the new variable to equal 5 when x2=1
.
x1 = c(5,3,2,2,5,7,7,3,4,8)
x2 = c(0,1,0,1,0,1,0,1,0,1)
id = c("A","A","A","B","B","C","C","C","D","D")
year = c(2001,2002,2003,2001,2002,2001,2001,2002,2001,2002)
df = data.frame(year,id,x1,x2)
Considering this sample data frame, I would need the new variable to look like this:
outcome = c(.,5,.,.,.,.,.,7,.,4)
The dataset I'm working with has close to 10.000.000 observations, for 3.000.000 individuals over 4 years, so I can't do this manually. Is there any generalized way to achieve this that works with any dataset, regardless of dimension?
I went through a few posts here using for loops to iterate over groups (one example was this one Iterating a for loop over groups in a dataset) but I wasn't able to apply any of it. I've been trying to do it in R after being unsuccessful in stata 14. I wasn't able to find any post that applied to ordered groups, which is what I'm looking for.
CodePudding user response:
Here's a simple way to get your outcome
with dplyr
.
library(dplyr)
df %>%
group_by(id) %>%
mutate(
outcome = ifelse(x2 == 1 & lag(x2) == 0, lag(x1), NA)
)
Result
# A tibble: 10 × 5
# Groups: id [4]
year id x1 x2 outcome
<dbl> <chr> <dbl> <dbl> <dbl>
1 2001 A 5 0 NA
2 2002 A 3 1 5
3 2003 A 2 0 NA
4 2001 B 2 1 NA
5 2002 B 5 0 NA
6 2001 C 7 1 NA
7 2001 C 7 0 NA
8 2002 C 3 1 7
9 2001 D 4 0 NA
10 2002 D 8 1 4