Good evening,
I have a very large data set in R and I'm trying to find the best way to loop through it to solve a few problems. Imagine the data as historical employee work hours. It looks something like:
rawTable:
Department Name Date Hours
Engineering Mary 2021-01-01 8
Engineering Mary 2021-01-02 8
Engineering Mary 2021-01-03 0
Engineering Mary 2021-01-04 6
Sales Barry 2021-01-01 0
Sales Barry 2021-01-02 12
Sales Barry 2021-01-03 12
Sales Barry 2021-01-04 12
I have about 3,200 people on the list, and a row for every day of the year, so that table is obviously very large.
I need to add two columns to the table:
The first is LDO that shows (for each day) their last day off
The second is WSH shows how many hours that person has worked since their last day off. that will look like:
rawTable:
Department Name Date Hours LDO WSH
Engineering Mary 2021-01-01 8 2020-12-31 8
Engineering Mary 2021-01-02 8 2020-12-31 16
Engineering Mary 2021-01-03 0 2021-01-03 0
Engineering Mary 2021-01-04 6 2021-01-03 6
Sales Barry 2021-01-01 0 2021-01-01 0
Sales Barry 2021-01-02 12 2021-01-01 12
Sales Barry 2021-01-03 12 2021-01-01 24
Sales Barry 2021-01-04 12 2021-01-01 36
I've tried using a for loop to have it apply the logic row-by-row. For each row, if the number of hours equals zero, then LDO=Date and WSH=0. If not, then LDO=LDO from previous row and WSH=WSH from previous hours. With this size set it takes forever and a half to run.
Next I created a function that, given a row, uses a copy of the big list and based on a "which" statement will tell me the row number of the last day that person worked 0 hours prior to the row date. This took forever and a half too. Along with that, I didn't even get to the WSH portion. That looked like:
rawLU <- rawTable
LDO = function(x) {
max(c(0, which((rawLU$Name == x["Name"]) &
(rawLU$Hours == 0) & (rawLU$Date <= x[Date])
)))
}
LastOff<-apply(rawTable,1,LDO)
I know there's a simpler way to do it, and I also know that I can't seem to figure it out.
Can anyone assist? Thanks in advance!
Mike
CodePudding user response:
Here's a possible solution with dplyr
-
Get the Date
value if the Hours = 0
, use fill
to get the previous non-working date on other rows. WSH
can be calculated using cumsum
.
library(dplyr)
library(tidyr)
rawTable %>%
mutate(Date = as.Date(Date)) %>%
group_by(Department, Name) %>%
mutate(LDO = if_else(Hours == 0, Date, as.Date(NA))) %>%
fill(LDO) %>%
mutate(LDO = if_else(is.na(LDO), min(Date) - 1, LDO)) %>%
group_by(LDO, .add = TRUE) %>%
mutate(WSH = cumsum(Hours)) %>%
ungroup
# Department Name Date Hours LDO WSH
# <chr> <chr> <date> <int> <date> <int>
#1 Engineering Mary 2021-01-01 8 2020-12-31 8
#2 Engineering Mary 2021-01-02 8 2020-12-31 16
#3 Engineering Mary 2021-01-03 0 2021-01-03 0
#4 Engineering Mary 2021-01-04 6 2021-01-03 6
#5 Sales Barry 2021-01-01 0 2021-01-01 0
#6 Sales Barry 2021-01-02 12 2021-01-01 12
#7 Sales Barry 2021-01-03 12 2021-01-01 24
#8 Sales Barry 2021-01-04 12 2021-01-01 36
data
rawTable <- structure(list(Department = c("Engineering", "Engineering", "Engineering",
"Engineering", "Sales", "Sales", "Sales", "Sales"), Name = c("Mary",
"Mary", "Mary", "Mary", "Barry", "Barry", "Barry", "Barry"),
Date = c("2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04",
"2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04"),
Hours = c(8L, 8L, 0L, 6L, 0L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA, -8L))
CodePudding user response:
df1 %>%
group_by(Department, Name, grp = cumsum(Hours==0)) %>%
mutate(Date = as.Date(Date),
LDO = first(Date) - (first(Hours)>0),
WHS = cumsum(Hours))
# A tibble: 8 x 7
# Groups: Department, Name, grp [3]
Department Name Date Hours grp LDO WHS
<chr> <chr> <date> <int> <int> <date> <int>
1 Engineering Mary 2021-01-01 8 0 2020-12-31 8
2 Engineering Mary 2021-01-02 8 0 2020-12-31 16
3 Engineering Mary 2021-01-03 0 1 2021-01-03 0
4 Engineering Mary 2021-01-04 6 1 2021-01-03 6
5 Sales Barry 2021-01-01 0 2 2021-01-01 0
6 Sales Barry 2021-01-02 12 2 2021-01-01 12
7 Sales Barry 2021-01-03 12 2 2021-01-01 24
8 Sales Barry 2021-01-04 12 2 2021-01-01 36