Home > Net >  Creating a binary variable if individual was observed in the previous year
Creating a binary variable if individual was observed in the previous year

Time:11-22

Let's say I have an example dataframe in the following format:

df <- data.frame( c(1,2,3,1,2,3,1,2,3),
                  c(3,3,3,2,2,2,1,1,1),
                  c(23,23,34,134,134,NA,45,NA,NA)
)
colnames(df) <- c("id", "year", "fte_wage")

df <- df[is.na(df$fte_wage) == FALSE,]

I want to create a binary variable (let's say, a column named "obs") if the individual was observed in the previous or not. I have tried the following:

library(dplyr)
df2 <- 
  df %>% 
  arrange(id, year) %>%
  group_by(id) %>% 
  rowwise() %>%
  mutate(obs = ifelse((lag(year) %in% df[df$id == id,]$year & year > lag(year)), 1, 0))

Which generates a column of only 0 values. If I remove the second condition the code works, but then it misinterprets the lag(year) command, as it takes values from different individuals as well.

My desired output would be a dataframe in the following format:

id year fte_wage ob
1 1 23 0
1 2 23 1
1 3 43 1
2 1 54 0
2 2 32 1
3 1 56 0

CodePudding user response:

You can just group_by(id) and then check if row_number() is > 1 to see if it falls in repeating run or is alone.

library(tidyverse)

df <- data.frame("id" = c(1,2,3,1,2,3,1,2,3),
                 "year" = c(3,3,3,2,2,2,1,1,1),
                  "fte_wage" = c(23,23,34,134,134,NA,45,NA,NA))

df %>% 
  drop_na(fte_wage) %>% 
  arrange(id, year) %>%
  group_by(id) %>% 
  mutate(obs = as.numeric(row_number() > 1))
#> # A tibble: 6 × 4
#> # Groups:   id [3]
#>      id  year fte_wage   obs
#>   <dbl> <dbl>    <dbl> <dbl>
#> 1     1     1       45     0
#> 2     1     2      134     1
#> 3     1     3       23     1
#> 4     2     2      134     0
#> 5     2     3       23     1
#> 6     3     3       34     0

Created on 2022-11-21 with reprex v2.0.2

CodePudding user response:

This is one approach using dplyr without grouping.

library(dplyr)

df %>% 
  na.omit() %>% 
  arrange(id, year) %>% 
  mutate(obs = (lag(id, default=F) == id) * 1)
  id year fte_wage obs
1  1    1       45   0
2  1    2      134   1
3  1    3       23   1
4  2    2      134   0
5  2    3       23   1
6  3    3       34   0

CodePudding user response:

You could use diff in the following way:

library(dplyr)

df %>%
  group_by(id) %>%
  arrange(id, year) %>%
  mutate(obs =  (c(0, diff(year)) == 1L))

Output:

# A tibble: 6 x 4
# Groups:   id [3]
     id  year fte_wage   obs
  <dbl> <dbl>    <dbl> <dbl>
1     1     1       45     0
2     1     2      134     1
3     1     3       23     1
4     2     2      134     0
5     2     3       23     1
6     3     3       34     0
  • Related