I am trying to analyze a large data set (>1.5M observations) to look for correlations between the timing and location of events. Unfortunately, I am encountering performance issues in creating my analysis data set.
Essentially, an event can occur at one of several facilities (A, B, and C), several sites at each facility (1-6), and at a range of dates (between January and June 1, 1990), and have a result value of 1 and 0. Sites are near one another, so an event at site 2 is likely to affect sites 1, 2, and 3. Events have a persistent effect as well, so an event on March 10 might affect future events through March 17 or so.
Here is some example data:
set.seed(12345)
df <- data.frame(
facility = sample(
c("A","B","C"),
100,
replace=TRUE),
site = sample(
1:6,
100,
replace=TRUE),
date = as.Date(
sample(
c(lubridate::ymd("1990-1-1"):lubridate::ymd("1990-6-1")),
100,
replace=TRUE),
origin = "1970-01-01"
),
outcome = sample(
c(0,1),100,
replace=TRUE),
stringsAsFactors = FALSE
)
So far I have managed to get something that works through each iteration:
# A place to put the output
outputdf <- data.frame(
facility = character(),
site = numeric(),
date = as.Date(character()),
outcome = numeric(),
recent_success = integer(),
recent_failures = integer(),
stringsAsFactors = FALSE
)
# Loop through each iteration
for(i in 1:nrow(df)){
# Let me know how things are going in the console
print(paste("Event ",i," of ",nrow(df),sep=""))
#Choose just one event at a time
EventofInterest <- df[i,]
# Get site and facility information for that event
facility_of_interest <- EventofInterest$facility %>%
unlist()
site_of_interest <- EventofInterest$site %>%
unlist()
# Count up recent successes
recent_success <- df %>%
filter(outcome == 1,
facility %in% facility_of_interest,
site %in% c((site-1),site,(site 1)),
date %within% lubridate::interval(date-7,date)) %>%
nrow()
# Count up recent failures
recent_failures <- df %>%
filter(outcome == 0,
facility %in% facility_of_interest,
site %in% c((site-1),site,(site 1)),
date %within% lubridate::interval(date-7,date)) %>%
nrow()
# Create an output dataframe with the tallied successes and failures
outputdf <- EventofInterest %>%
mutate(recent_success = recent_success,
recent_failures = recent_failures
) %>%
# Bind that to the existing output dataframe
bind_rows(outputdf)
}
And it even gives me the output I'm looking for:
> head(outputdf)
facility site date outcome recent_success recent_failures
1 C 4 1990-01-23 1 15 23
2 B 1 1990-02-18 1 16 19
3 B 1 1990-02-01 1 16 19
4 A 5 1990-01-06 1 10 17
5 B 5 1990-01-10 0 16 19
6 C 3 1990-02-26 1 15 23
But it gets unreasonably slow as my input dataframe gets larger (and more complex). The input data is about 150 mb.
So, my question is how do I speed this process up? It seems like a good fit for something like dplyr::summarize()
, or even pulling in some more processors (although I worry about RAM utilization). A for
loop is almost certainly one of the slower ways to go about this.
I've already tried some things based on other posts, like making sure as little of the calculation is done within the for
loop as possible. I have created date intervals in the input dataframe before starting, but that just seems to make the input bigger. I've also tried splitting out my successes and failures, but that just seems to hog more memory without speeding things up (evidently my choke point is not in comparing two numbers).
Any input would be greatly appreciated!
CodePudding user response:
I'm not quite sure this fully understands your question, but it sounds like you want to count based on a non-equi join, where facility matches exactly, site is within /- 1 and date is within the past week.
data.table
and sqldf
handle non-equi joins, and dplyr does not. Nonetheless, I think we can get a pretty performant vectorized dplyr answer by making copies of the data (to simulate adjacent sites) and using slider::slide_index_dbl
for the sliding time window.
df %>%
#make copies shifted site /- 1
uncount(3, .id = "version") %>%
mutate(site_shifted = site (version-2)) %>%
arrange(date) %>%
group_by(facility, site_shifted) %>%
mutate(recent_success = slider::slide_index_dbl(
outcome, date, ~sum(.x == 1), .before = lubridate::days(6)),
recent_failure = slider::slide_index_dbl(
outcome, date, ~sum(.x == 0), .before = lubridate::days(6))) %>%
ungroup()