Calculate conditions on grouped cumulative sums in R-CodePudding

Let's assume, that this data frame logs registrations for persons (ID)... Each person has a number of records which start on START_DT and end on END_DT. For each person in the data frame, I need to calculate the date when they have reached 365 days of registration. Some people will not achieve a year, others may have multiple records before they reach 365 days. Others may achieve this on their very first record (like ID = 1502).

# DATA GENERATION
#set seed and generate a list of 10,000 random dates
set.seed(42)
START_DT <- sample(seq(as.Date('1910/01/01'), as.Date('2000/01/01'), by="day"), 10000)

#generaet a list of random numbers to add to the START DATES
DIFF_DAYS <- floor(runif(10000, min=1, max=100))

#add the two
END_DT <- START_DT   DIFF_DAYS

#generate person IDs
set.seed(42)
ID <- floor(runif(10000, min=1, max=1500))

#make into a dataframe
df <- data.frame(ID, START_DT, DIFF_DAYS, END_DT)

#generate an additional row to ensure that at least one record surpasses 365 days in the first record.
row <- list(1502, as.Date('1910/01/01'), 2102, as.Date('1915-10-04'))
df <- rbind(df, row)

The main issue is that I have 100 billion rows, so it needs to be fast enough to churn through that.

My approach so far has been to

create a column for the cumulative counts of days (for each person)
create another column where cumulative counts of days < 365 = 0, else, 1, thus indicating in which record the person achieves the 365 days.
looping through, subsetting the main dataset by ID, and then calculating the result with an if/elif statement.

I figured that this may have been a slow way of doing it... does anybody have any interesting way of solving this where it can be done as quickly as possible, please?

Very grateful for any ideas as to how I can approach this problem. Many thanks in advance :)

CodePudding user response：

data.table will probably be the fastest solution provided that you are not using a database. It will actually be faster than dtplyr in particular since it has to do the work to convert dplyr syntax into data.table syntax.

That being said, below is a fairly slow way of accomplishing this within the data.table framework but otherwise pretty fast. There are almost certainly opportunities to use keys to improve speed through binary searches, but I'm not certain that e.g. finding the values above some threshold within groups is faster with a binary search than across the pooled values.

library(data.table)

# Assume that we do NOT have the "difference" variable
df$DIFF_DAYS <- NULL

# Convert to data.table
dt <- as.data.table(df)

# Add diff = end - start date
dt[, diff := as.numeric(difftime(END_DT, START_DT, units = "days"))]

# Add cumulative total per ID
dt[, csum := cumsum(diff), by = ID]

# Keep only rows where csum >= 365
dt <- dt[dt[, .I[csum >= 365], by = ID]$V1]

# Keep only minimum csum row per ID
dt <- dt[dt[, .I[csum == min(csum)], by = ID]$V1]

# Calculate anniversary
dt[, anniversary := END_DT - (csum - 365)]

print(dt)
#>        ID   START_DT     END_DT diff csum anniversary
#>   1: 1405 1913-12-04 1914-02-06   64  398  1914-01-04
#>   2: 1245 1928-03-15 1928-06-12   89  412  1928-04-26
#>   3:  962 1927-01-11 1927-03-22   70  373  1927-03-14
#>   4:  779 1932-05-09 1932-07-04   56  401  1932-05-29
#>   5: 1105 1993-07-31 1993-10-12   73  389  1993-09-18
#>  ---                                                 
#> 585:  346 1968-06-07 1968-07-18   41  370  1968-07-13
#> 586:  882 1950-08-31 1950-11-15   76  377  1950-11-03
#> 587: 1490 1975-06-30 1975-10-02   94  394  1975-09-03
#> 588:  958 1926-10-13 1926-12-11   59  384  1926-11-22
#> 589: 1502 1910-01-01 1915-10-04 2102 2102  1911-01-01

CodePudding user response：

In the absence of an example of your expected result, this is one approach that will I think allow you to solve the problem.

In the future, please include your expected result.

The logic I applied here is to calculate the cumulative days per ID, and to filter the first occurrence when cumdays are above 365 for each ID. (Note the use of duplicated within the filter to keep first occurences only)

When this is done, simply substract the difference between cumdays and 365 from the end date, to find out the exact date at which the ID exceeds 365 days for the first time.

classical dplyr

library(dplyr)
library(lubridate)
df <- as_tibble(df)
# first let's see when cumulative days are above 365 for the first time for each ID
res1 <- df %>% group_by(ID) %>% arrange(ID, START_DT) %>% 
mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% 
# now, let's calculate the date at which cumdays are above 365
      mutate(days_since_365 = cumdays-365, date365=END_DT-
    days_since_365)

    # A tibble: 589 × 7
    # Groups:   ID [589]
          ID START_DT   DIFF_DAYS END_DT     cumdays days_since_365 date365   
       <dbl> <date>         <dbl> <date>       <dbl>          <dbl> <date>    
     1     1 1955-10-12        80 1955-12-31     434             69 1955-10-23
     2     3 1981-11-25        87 1982-02-20     393             28 1982-01-23
     3     6 1979-09-07        88 1979-12-04     411             46 1979-10-19
     4     8 1994-05-08        31 1994-06-08     386             21 1994-05-18
     5     9 1986-01-15        79 1986-04-04     404             39 1986-02-24
     6    11 1952-08-08        80 1952-10-27     370              5 1952-10-22

Since you have many rows, using dtplyr will speed up the process.

library(dtplyr)
df2 <- lazy_dt(df)
res2 <- df2 %>% group_by(ID) %>% arrange(ID, START_DT) %>% mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% # now, let's calculate the date at which cumdays are above 365
  mutate(days_since_365 = cumdays-365, date365=END_DT-days_since_365)
res2 <- as_tibble(res2)
identical(res2, res1) # identical returns FALSE because of a difference in class
class(res1)
class(res2)
(res2==res1) %>% all # TRUE, the figures are all equal

benchmarking

Actually about 30 times faster, while if you're doing loops, dplyr should already improve on that.

library(microbenchmark)
df3 <- df
df3$DIFF_DAYS <- NULL
dt <- as.data.table(df3)
microbenchmark(dplyr = df %>% group_by(ID) %>% arrange(ID, START_DT) %>% mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% # now, let's calculate the date at which cumdays are above 365
                 mutate(days_since_365 = cumdays-365, date365=END_DT-days_since_365),
               dtplyr = df2 %>% group_by(ID) %>% arrange(ID, START_DT) %>% mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% # now, let's calculate the date at which cumdays are above 365
                 mutate(days_since_365 = cumdays-365, date365=END_DT-days_since_365),
                 pure_dt = dt[, diff := as.numeric(difftime(END_DT, START_DT, units = "days"))][, csum := cumsum(diff), by = ID][
                   dt[, .I[csum == min(csum)], by = ID]$V1][, anniversary := END_DT - (csum - 365)])

Unit: milliseconds
    expr     min       lq      mean   median       uq     max neval
   dplyr 28.7915 29.84825 31.681433 30.29995 33.36750 43.5997   100
  dtplyr  1.0071  1.11600  1.237315  1.23845  1.34355  1.6474   100
 pure_dt  1.4610  1.61920  1.878730  1.78385  1.90615  6.6077   100

This being said, the amount of data that can be processed on a single machine will be limited by memory.

I included the @socialscientist's pure_dt approach in this benchmark, and dtplyr is still faster, which indicates that the algorithms used by the package are more efficient's than their implementation.

The advantage with dtplyr, although I like pure data.table, is that you can use a syntax that is clearer, if somewhat less terse. dtplyr seems to guarantee a certain level of backend efficiency.