My data frame comprises 3 columns, a grouping factor, Current_Date
, and Start_Date
(by definition, Current_Date
≥ Start_Date
; date format is dmy
), with multiple duplicates of dates in each column and a lag of varying length between them. Some dates overlap between groups, but some don't.
The real data is hundreds of thousands rows long, so my problem is finding an efficient way to assign for each row the unique
number of overlaps between the entire Date
column (by
the grouping factor), and the seq
of dates defined by the time window (Start_Date
to Current_Date
), which is specific for each row.
A dummy data is presented below, with the desired result of added Dates_in_range
column, but without the grouping factor, which I don't know how to handle in for
-loop format
(e.g., there is only one unique date in the window between 21-10-22 and 21-10-22, but three unique dates in the window between 21-10-22 and 28-10-22):
Current_Date Start_Date Dates_in_range
1 21-10-22 21-10-22 1
2 26-10-22 26-10-22 1
3 26-10-22 21-10-22 2
4 26-10-22 26-10-22 1
5 26-10-22 21-10-22 2
6 28-10-22 26-10-22 2
7 28-10-22 28-10-22 1
8 28-10-22 21-10-22 3
My solution is based on creating two types of lists containing dates, by using for
-loops, and adding each type as a temporary column into the data table:
first type is identical lists of all dates shared by the entire data set (or a group within it) repeated over all rows in the data table (or a group within it);
second type is row-specific lists, derived from the time window specified by Current_Date
and Start_Date
.
I then find an intersect
between the two list columns for each row, applying another for
-loop.
A reproducible code is attached below:
library(data.table)
## Load the data set
dt = data.table(Current_Date= c("21-10-22","26-10-22","26-10-22","26-10-22","26-10-22","28-10-22","28-10-22","28-10-22"),
Start_Date = c("21-10-22","26-10-22","21-10-22","26-10-22","21-10-22","26-10-22","28-10-22","21-10-22"))
# Specify dates into DMY date format
library(lubridate)
dt$Current_Date<- dmy(dt$Current_Date)
dt$Start_Date <- dmy(dt$Start_Date)
## Create a list of all current dates within the data set (= Current_Date column)
Dates_all <- as.list(dt$Current_Date)
# Add the list as a Dates_all column to the data set
dt$All_dates <- list()
for (i in 1:length(dt[, Current_Date])){
dt$All_dates[[i]] <- Dates_all
}
## Create a list of sequences of all possible dates within the date period (from Start_Date to Current_Date) for each row
Date_window <- list()
for (i in 1:length(dt[, Current_Date])){
Date_window[[i]] <- as.list(seq(as.Date(dt[i, Start_Date]), as.Date(dt[i, Current_Date]), by="days"))
}
# Add the list as a Date_window column to the data set
dt$Date_window <- Date_window
## Add the Dates_in_range column containing the number of dates from Current_Date column, occurring in the row-specific time window
for (i in 1:length(dt[, Current_Date])){
dt$Dates_in_range[[i]] <- length(intersect(dt$Date_window[[i]], dt$All_dates[[i]]))
}
# Cleanup & print
dt[, c("Date_window","All_dates") := NULL]
rm(Dates_all, Date_window, i)
print(dt)
I suspect it can be accomplished using foverlaps
function, but I am not sure how to apply it in this case.
Thanks in advance!
CodePudding user response:
Here's another approach, still using for loops that should work. Basically,
we start off by getting a vector of all possible dates, then define a function that will check whether those dates are within a range defined by a minimum or maximum, then we use a for loop through the dataset to apply that function to each row of the data. Of course, we could use vapply()
here or something of the like if you prefer vectorization.
# get unique dates from all columns
dates <- unique(c(dt$Current_Date, dt$Current_Date))
# function to see how many are in a range
n_in_range <- function(d, mn, mx) {
sum(d <= mx & d >= mn)
}
#for loop
dt$Dates_in_range <- NA
for (i in 1:nrow(dt)) {
dt$Dates_in_range[i] <- n_in_range(dates, dt$Start_Date[i], dt$Current_Date[i])
}
dt
Current_Date Start_Date Dates_in_range
1: 2022-10-21 2022-10-21 1
2: 2022-10-26 2022-10-26 1
3: 2022-10-26 2022-10-21 2
4: 2022-10-26 2022-10-26 1
5: 2022-10-26 2022-10-21 2
6: 2022-10-28 2022-10-26 2
7: 2022-10-28 2022-10-28 1
8: 2022-10-28 2022-10-21 3
CodePudding user response:
With data.table
, you can do the following:
library(data.table)
dt = data.table(Current_Date= c("21-10-22","26-10-22","26-10-22","26-10-22","26-10-22","28-10-22","28-10-22","28-10-22"),
Start_Date = c("21-10-22","26-10-22","21-10-22","26-10-22","21-10-22","26-10-22","28-10-22","21-10-22"))
dt[,Dates_in_range := sum(between(dt[,unique(Current_Date)], Start_Date, Current_Date)),
by=rownames(dt)]
dt
#> Current_Date Start_Date Dates_in_range
#> 1: 21-10-22 21-10-22 1
#> 2: 26-10-22 26-10-22 1
#> 3: 26-10-22 21-10-22 2
#> 4: 26-10-22 26-10-22 1
#> 5: 26-10-22 21-10-22 2
#> 6: 28-10-22 26-10-22 2
#> 7: 28-10-22 28-10-22 1
#> 8: 28-10-22 21-10-22 3
CodePudding user response:
Using sapply
:
dt[, n := sapply(Start_Date, function(x, y) sum(x <= y), y = unique(Start_Date)), by = Current_Date]
or using frank
:
dt[, n := frank(1/as.integer(Start_Date), ties.method = "dense"), by = Current_Date]