Take differences of columns conditional on being associated to others (in R)-CodePudding

  id hire_month sep_month wage_jan wage_feb wage_mar wage_apr
1  1          2         3      740      780      780      780
2  1          4         0      890      890      890      890
3  2          3         5      550      550      550      550
4  2          5        10      890      250      250      400
  wage_may wage_jun wage_jul wage_aug wage_sep wage_oct wage_nov
1      780      780      780      780      780      780      780
2      890      890      890      890      890      790      250
3      550      550      550      550      550      550      550
4      500      890      600      750      890      300      300
  wage_dec
1      780
2      300
3      550
4      300

structure(list(id = c(1L, 1L, 2L, 2L), hire_month = c(2L, 4L, 
3L, 5L), sep_month = c(3L, 0L, 5L, 10L), wage_jan = c(740L, 890L, 
550L, 890L), wage_feb = c(780L, 890L, 550L, 250L), wage_mar = c(780L, 
890L, 550L, 250L), wage_apr = c(780L, 890L, 550L, 400L), wage_may = c(780L, 
890L, 550L, 500L), wage_jun = c(780L, 890L, 550L, 890L), wage_jul = c(780L, 
890L, 550L, 600L), wage_aug = c(780L, 890L, 550L, 750L), wage_sep = c(780L, 
890L, 550L, 890L), wage_oct = c(780L, 790L, 550L, 300L), wage_nov = c(780L, 
250L, 550L, 300L), wage_dec = c(780L, 300L, 550L, 300L)), class = "data.frame", row.names = c(NA, 
-4L))

I would like to take the differences between wages based on hire_month and lag(sep_month) -- separation month for the previous row and same id. For example, if lag(sep_month) is 4 (april) and hire_month is 7 (july), I would like to take the differences between wage_jul and wage_apr (wage_apr being the one on the previous row). So the wages I need to take differences between are actually in different rows, because each row in my dataset is a contract.

I have a large dataset, so I would like a way to automate this association.

CodePudding user response：

Updated

The OP clarified that the sep month value comes from the row above, which means that there are n-1 rows for each ID, where n is the number of rows for that ID. In the example above, there are two rows per ID, we will estimate the difference for the 2nd (and last) row only for each ID

data$diff = as_tibble(data) %>%
  mutate(sep_wage = apply(.,1,function(x) x[x[3] 3])) %>% 
  group_by(id) %>%
  mutate(sep_wage = lag(sep_wage)) %>% ungroup() %>% 
  apply(.,MARGIN = 1, function(x) x[x[2] 3] - x[16])

Output

  id hire_month sep_month wage_jan wage_feb wage_mar wage_apr wage_may wage_jun wage_jul wage_aug wage_sep wage_oct wage_nov wage_dec diff
1  1          2         3      740      780      780      780      780      780      780      780      780      780      780      780   NA
2  1          4         0      890      890      890      890      890      890      890      890      890      790      250      300  110
3  2          3         5      550      550      550      550      550      550      550      550      550      550      550      550   NA
4  2          5        10      890      250      250      400      500      890      600      750      890      300      300      300  -50