Home > Software design >  R How can I find the most recent row with a certain value
R How can I find the most recent row with a certain value

Time:10-13

Good evening,

I have a very large data set in R and I'm trying to find the best way to loop through it to solve a few problems. Imagine the data as historical employee work hours. It looks something like:

rawTable:

Department      Name      Date         Hours

Engineering     Mary      2021-01-01   8
Engineering     Mary      2021-01-02   8
Engineering     Mary      2021-01-03   0
Engineering     Mary      2021-01-04   6
Sales           Barry     2021-01-01   0
Sales           Barry     2021-01-02   12
Sales           Barry     2021-01-03   12
Sales           Barry     2021-01-04   12    

I have about 3,200 people on the list, and a row for every day of the year, so that table is obviously very large.

I need to add two columns to the table:

The first is LDO that shows (for each day) their last day off

The second is WSH shows how many hours that person has worked since their last day off. that will look like:

rawTable:

Department      Name      Date         Hours  LDO          WSH

Engineering     Mary      2021-01-01   8      2020-12-31   8
Engineering     Mary      2021-01-02   8      2020-12-31   16
Engineering     Mary      2021-01-03   0      2021-01-03   0
Engineering     Mary      2021-01-04   6      2021-01-03   6
Sales           Barry     2021-01-01   0      2021-01-01   0
Sales           Barry     2021-01-02   12     2021-01-01   12
Sales           Barry     2021-01-03   12     2021-01-01   24
Sales           Barry     2021-01-04   12     2021-01-01   36

I've tried using a for loop to have it apply the logic row-by-row. For each row, if the number of hours equals zero, then LDO=Date and WSH=0. If not, then LDO=LDO from previous row and WSH=WSH from previous hours. With this size set it takes forever and a half to run.

Next I created a function that, given a row, uses a copy of the big list and based on a "which" statement will tell me the row number of the last day that person worked 0 hours prior to the row date. This took forever and a half too. Along with that, I didn't even get to the WSH portion. That looked like:

rawLU <- rawTable

LDO = function(x) {
  max(c(0, which((rawLU$Name == x["Name"]) &
                   (rawLU$Hours == 0) & (rawLU$Date <= x[Date])
  )))
}

LastOff<-apply(rawTable,1,LDO)

I know there's a simpler way to do it, and I also know that I can't seem to figure it out.

Can anyone assist? Thanks in advance!

Mike

CodePudding user response:

Here's a possible solution with dplyr -

Get the Date value if the Hours = 0, use fill to get the previous non-working date on other rows. WSH can be calculated using cumsum.

library(dplyr)
library(tidyr)

rawTable %>%
  mutate(Date = as.Date(Date)) %>%
  group_by(Department, Name) %>%
  mutate(LDO = if_else(Hours == 0, Date, as.Date(NA))) %>%
  fill(LDO) %>%
  mutate(LDO = if_else(is.na(LDO), min(Date) - 1, LDO)) %>%
  group_by(LDO, .add = TRUE) %>%
  mutate(WSH = cumsum(Hours)) %>%
  ungroup

#  Department  Name  Date       Hours LDO          WSH
#  <chr>       <chr> <date>     <int> <date>     <int>
#1 Engineering Mary  2021-01-01     8 2020-12-31     8
#2 Engineering Mary  2021-01-02     8 2020-12-31    16
#3 Engineering Mary  2021-01-03     0 2021-01-03     0
#4 Engineering Mary  2021-01-04     6 2021-01-03     6
#5 Sales       Barry 2021-01-01     0 2021-01-01     0
#6 Sales       Barry 2021-01-02    12 2021-01-01    12
#7 Sales       Barry 2021-01-03    12 2021-01-01    24
#8 Sales       Barry 2021-01-04    12 2021-01-01    36

data

rawTable <- structure(list(Department = c("Engineering", "Engineering", "Engineering", 
"Engineering", "Sales", "Sales", "Sales", "Sales"), Name = c("Mary", 
"Mary", "Mary", "Mary", "Barry", "Barry", "Barry", "Barry"), 
    Date = c("2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04", 
    "2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04"), 
    Hours = c(8L, 8L, 0L, 6L, 0L, 12L, 12L, 12L)), class = "data.frame", row.names = c(NA, -8L))

CodePudding user response:

df1 %>%
   group_by(Department, Name, grp = cumsum(Hours==0)) %>%
   mutate(Date = as.Date(Date),
      LDO = first(Date) - (first(Hours)>0),
      WHS = cumsum(Hours))

# A tibble: 8 x 7
# Groups:   Department, Name, grp [3]
  Department  Name  Date       Hours   grp LDO          WHS
  <chr>       <chr> <date>     <int> <int> <date>     <int>
1 Engineering Mary  2021-01-01     8     0 2020-12-31     8
2 Engineering Mary  2021-01-02     8     0 2020-12-31    16
3 Engineering Mary  2021-01-03     0     1 2021-01-03     0
4 Engineering Mary  2021-01-04     6     1 2021-01-03     6
5 Sales       Barry 2021-01-01     0     2 2021-01-01     0
6 Sales       Barry 2021-01-02    12     2 2021-01-01    12
7 Sales       Barry 2021-01-03    12     2 2021-01-01    24
8 Sales       Barry 2021-01-04    12     2 2021-01-01    36
  • Related