Efficiently count and classify events occurring within a specified interval-CodePudding

I am trying to analyze a large data set (>1.5M observations) to look for correlations between the timing and location of events. Unfortunately, I am encountering performance issues in creating my analysis data set.

Essentially, an event can occur at one of several facilities (A, B, and C), several sites at each facility (1-6), and at a range of dates (between January and June 1, 1990), and have a result value of 1 and 0. Sites are near one another, so an event at site 2 is likely to affect sites 1, 2, and 3. Events have a persistent effect as well, so an event on March 10 might affect future events through March 17 or so.

Here is some example data:

set.seed(12345)

df <- data.frame(
  facility = sample(
    c("A","B","C"),
    100,
    replace=TRUE),
  
  site = sample(
    1:6,
    100,
    replace=TRUE),
  
  date = as.Date(
    sample(
      c(lubridate::ymd("1990-1-1"):lubridate::ymd("1990-6-1")),
      100,
      replace=TRUE),
    origin = "1970-01-01"
    ),
    
  outcome = sample(
    c(0,1),100,
    replace=TRUE),
  
  stringsAsFactors = FALSE
)

So far I have managed to get something that works through each iteration:

# A place to put the output
outputdf <- data.frame(
  facility = character(),
  site = numeric(),
  date = as.Date(character()),
  outcome = numeric(),
  recent_success = integer(),
  recent_failures = integer(),
  stringsAsFactors = FALSE
)

# Loop through each iteration
for(i in 1:nrow(df)){
  # Let me know how things are going in the console
  print(paste("Event ",i," of ",nrow(df),sep=""))
  
  #Choose just one event at a time
  EventofInterest <- df[i,]

  # Get site and facility information for that event
  facility_of_interest <- EventofInterest$facility %>%
    unlist()
  
  site_of_interest <- EventofInterest$site %>%
    unlist()
  
 # Count up recent successes
  recent_success <- df %>%
    filter(outcome == 1,
           facility %in% facility_of_interest,
           site %in% c((site-1),site,(site 1)),
           date %within% lubridate::interval(date-7,date)) %>% 
    nrow()
  
  # Count up recent failures
recent_failures <- df %>%
  filter(outcome == 0,
         facility %in% facility_of_interest,
         site %in% c((site-1),site,(site 1)),
         date %within% lubridate::interval(date-7,date)) %>% 
  nrow()
  
  # Create an output dataframe with the tallied successes and failures
  outputdf <- EventofInterest %>%
    mutate(recent_success = recent_success,
           recent_failures = recent_failures
    ) %>%
    # Bind that to the existing output dataframe
    bind_rows(outputdf)
  
  
}

And it even gives me the output I'm looking for:

> head(outputdf)
  facility site       date outcome recent_success recent_failures
1        C    4 1990-01-23       1             15              23
2        B    1 1990-02-18       1             16              19
3        B    1 1990-02-01       1             16              19
4        A    5 1990-01-06       1             10              17
5        B    5 1990-01-10       0             16              19
6        C    3 1990-02-26       1             15              23

But it gets unreasonably slow as my input dataframe gets larger (and more complex). The input data is about 150 mb.

So, my question is how do I speed this process up? It seems like a good fit for something like dplyr::summarize(), or even pulling in some more processors (although I worry about RAM utilization). A for loop is almost certainly one of the slower ways to go about this.

I've already tried some things based on other posts, like making sure as little of the calculation is done within the for loop as possible. I have created date intervals in the input dataframe before starting, but that just seems to make the input bigger. I've also tried splitting out my successes and failures, but that just seems to hog more memory without speeding things up (evidently my choke point is not in comparing two numbers).

Any input would be greatly appreciated!

CodePudding user response：

I'm not quite sure this fully understands your question, but it sounds like you want to count based on a non-equi join, where facility matches exactly, site is within /- 1 and date is within the past week.

data.table and sqldf handle non-equi joins, and dplyr does not. Nonetheless, I think we can get a pretty performant vectorized dplyr answer by making copies of the data (to simulate adjacent sites) and using slider::slide_index_dbl for the sliding time window.

df %>%
  #make copies shifted site  /- 1
  uncount(3, .id = "version") %>%
  mutate(site_shifted = site   (version-2)) %>%
  
  arrange(date) %>%
  group_by(facility, site_shifted) %>%
  mutate(recent_success = slider::slide_index_dbl(
           outcome, date, ~sum(.x == 1), .before = lubridate::days(6)),
         recent_failure = slider::slide_index_dbl(
           outcome, date, ~sum(.x == 0), .before = lubridate::days(6))) %>%
  ungroup()