Let's assume, that this data frame logs registrations for persons (ID)... Each person has a number of records which start on START_DT and end on END_DT. For each person in the data frame, I need to calculate the date when they have reached 365 days of registration. Some people will not achieve a year, others may have multiple records before they reach 365 days. Others may achieve this on their very first record (like ID = 1502).
# DATA GENERATION
#set seed and generate a list of 10,000 random dates
set.seed(42)
START_DT <- sample(seq(as.Date('1910/01/01'), as.Date('2000/01/01'), by="day"), 10000)
#generaet a list of random numbers to add to the START DATES
DIFF_DAYS <- floor(runif(10000, min=1, max=100))
#add the two
END_DT <- START_DT DIFF_DAYS
#generate person IDs
set.seed(42)
ID <- floor(runif(10000, min=1, max=1500))
#make into a dataframe
df <- data.frame(ID, START_DT, DIFF_DAYS, END_DT)
#generate an additional row to ensure that at least one record surpasses 365 days in the first record.
row <- list(1502, as.Date('1910/01/01'), 2102, as.Date('1915-10-04'))
df <- rbind(df, row)
The main issue is that I have 100 billion rows, so it needs to be fast enough to churn through that.
My approach so far has been to
- create a column for the cumulative counts of days (for each person)
- create another column where cumulative counts of days < 365 = 0, else, 1, thus indicating in which record the person achieves the 365 days.
- looping through, subsetting the main dataset by ID, and then calculating the result with an if/elif statement.
I figured that this may have been a slow way of doing it... does anybody have any interesting way of solving this where it can be done as quickly as possible, please?
Very grateful for any ideas as to how I can approach this problem. Many thanks in advance :)
CodePudding user response:
data.table
will probably be the fastest solution provided that you are not using a database. It will actually be faster than dtplyr
in particular since it has to do the work to convert dplyr
syntax into data.table
syntax.
That being said, below is a fairly slow way of accomplishing this within the data.table
framework but otherwise pretty fast. There are almost certainly opportunities to use key
s to improve speed through binary searches, but I'm not certain that e.g. finding the values above some threshold within groups is faster with a binary search than across the pooled values.
library(data.table)
# Assume that we do NOT have the "difference" variable
df$DIFF_DAYS <- NULL
# Convert to data.table
dt <- as.data.table(df)
# Add diff = end - start date
dt[, diff := as.numeric(difftime(END_DT, START_DT, units = "days"))]
# Add cumulative total per ID
dt[, csum := cumsum(diff), by = ID]
# Keep only rows where csum >= 365
dt <- dt[dt[, .I[csum >= 365], by = ID]$V1]
# Keep only minimum csum row per ID
dt <- dt[dt[, .I[csum == min(csum)], by = ID]$V1]
# Calculate anniversary
dt[, anniversary := END_DT - (csum - 365)]
print(dt)
#> ID START_DT END_DT diff csum anniversary
#> 1: 1405 1913-12-04 1914-02-06 64 398 1914-01-04
#> 2: 1245 1928-03-15 1928-06-12 89 412 1928-04-26
#> 3: 962 1927-01-11 1927-03-22 70 373 1927-03-14
#> 4: 779 1932-05-09 1932-07-04 56 401 1932-05-29
#> 5: 1105 1993-07-31 1993-10-12 73 389 1993-09-18
#> ---
#> 585: 346 1968-06-07 1968-07-18 41 370 1968-07-13
#> 586: 882 1950-08-31 1950-11-15 76 377 1950-11-03
#> 587: 1490 1975-06-30 1975-10-02 94 394 1975-09-03
#> 588: 958 1926-10-13 1926-12-11 59 384 1926-11-22
#> 589: 1502 1910-01-01 1915-10-04 2102 2102 1911-01-01
CodePudding user response:
In the absence of an example of your expected result, this is one approach that will I think allow you to solve the problem.
In the future, please include your expected result.
The logic I applied here is to calculate the cumulative days per ID, and to filter the first occurrence when cumdays are above 365 for each ID. (Note the use of duplicated
within the filter
to keep first occurences only)
When this is done, simply substract the difference between cumdays and 365 from the end date, to find out the exact date at which the ID exceeds 365 days for the first time.
classical dplyr
library(dplyr)
library(lubridate)
df <- as_tibble(df)
# first let's see when cumulative days are above 365 for the first time for each ID
res1 <- df %>% group_by(ID) %>% arrange(ID, START_DT) %>%
mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>%
# now, let's calculate the date at which cumdays are above 365
mutate(days_since_365 = cumdays-365, date365=END_DT-
days_since_365)
# A tibble: 589 × 7
# Groups: ID [589]
ID START_DT DIFF_DAYS END_DT cumdays days_since_365 date365
<dbl> <date> <dbl> <date> <dbl> <dbl> <date>
1 1 1955-10-12 80 1955-12-31 434 69 1955-10-23
2 3 1981-11-25 87 1982-02-20 393 28 1982-01-23
3 6 1979-09-07 88 1979-12-04 411 46 1979-10-19
4 8 1994-05-08 31 1994-06-08 386 21 1994-05-18
5 9 1986-01-15 79 1986-04-04 404 39 1986-02-24
6 11 1952-08-08 80 1952-10-27 370 5 1952-10-22
Since you have many rows, using dtplyr will speed up the process.
library(dtplyr)
df2 <- lazy_dt(df)
res2 <- df2 %>% group_by(ID) %>% arrange(ID, START_DT) %>% mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% # now, let's calculate the date at which cumdays are above 365
mutate(days_since_365 = cumdays-365, date365=END_DT-days_since_365)
res2 <- as_tibble(res2)
identical(res2, res1) # identical returns FALSE because of a difference in class
class(res1)
class(res2)
(res2==res1) %>% all # TRUE, the figures are all equal
benchmarking
Actually about 30 times faster, while if you're doing loops, dplyr should already improve on that.
library(microbenchmark)
df3 <- df
df3$DIFF_DAYS <- NULL
dt <- as.data.table(df3)
microbenchmark(dplyr = df %>% group_by(ID) %>% arrange(ID, START_DT) %>% mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% # now, let's calculate the date at which cumdays are above 365
mutate(days_since_365 = cumdays-365, date365=END_DT-days_since_365),
dtplyr = df2 %>% group_by(ID) %>% arrange(ID, START_DT) %>% mutate(cumdays = cumsum(DIFF_DAYS)) %>% filter(cumdays>=365& !duplicated(cumdays>=365)) %>% # now, let's calculate the date at which cumdays are above 365
mutate(days_since_365 = cumdays-365, date365=END_DT-days_since_365),
pure_dt = dt[, diff := as.numeric(difftime(END_DT, START_DT, units = "days"))][, csum := cumsum(diff), by = ID][
dt[, .I[csum == min(csum)], by = ID]$V1][, anniversary := END_DT - (csum - 365)])
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 28.7915 29.84825 31.681433 30.29995 33.36750 43.5997 100
dtplyr 1.0071 1.11600 1.237315 1.23845 1.34355 1.6474 100
pure_dt 1.4610 1.61920 1.878730 1.78385 1.90615 6.6077 100
This being said, the amount of data that can be processed on a single machine will be limited by memory.
I included the @socialscientist's pure_dt approach in this benchmark, and dtplyr is still faster, which indicates that the algorithms used by the package are more efficient's than their implementation.
The advantage with dtplyr
, although I like pure data.table
, is that you can use a syntax that is clearer, if somewhat less terse. dtplyr
seems to guarantee a certain level of backend efficiency.