I have a tibble with the following columns with over a million rows:
Emp_ID Start_Date Consec_Days
<int> <date> <dbl>
1 4 2018-02-19 NA
2 4 2018-02-20 1
3 4 2018-02-21 1
4 3 2016-12-26 NA
5 3 2017-01-02 0
6 3 2017-01-03 1
7 3 2017-01-04 1
8 3 2017-01-06 0
9 3 2017-01-07 1
10 3 2017-01-09 0
Start_Date is the day an employee goes to work. I'd like to create a column that calculates the average number of consecutive days worked for each unique employee ID. Each NA value denotes the very first day a particular employee went to work. The problem is I don't know how to account for the NA values and 0's in the Consec_Days column to get an average of consecutive days.
I've tried referencing the following posts but haven't been able to make it work on my end:
How to find mean of n consecutive days in each group r
Any help would be greatly appreciated!
CodePudding user response:
Something like this?
df %>%
group_by(Emp_ID) %>%
mutate(average_cons_days = sum(Consec_Days, na.rm = TRUE)/n())
Emp_ID Start_Date Consec_Days average_cons_days
<int> <chr> <int> <dbl>
1 4 2018-02-19 NA 0.667
2 4 2018-02-20 1 0.667
3 4 2018-02-21 1 0.667
4 3 2016-12-26 NA 0.429
5 3 2017-01-02 0 0.429
6 3 2017-01-03 1 0.429
7 3 2017-01-04 1 0.429
8 3 2017-01-06 0 0.429
9 3 2017-01-07 1 0.429
10 3 2017-01-09 0 0.429