R nested for-loops takes 30 days to run-CodePudding

I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame. I have a first data.frame (dates) which has 122880 rows of dates (each 15 min). I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID. I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.

Here is my code:

for(hour in 1:122880){
        for(patient in 1:nrow(episode){
                if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
                        no_episode = episode$id[patient]
                        if(care$begin[care$id==no_episode]>dates[hour]{
                                nb_wait = nb_wait   1
                                delay = delay   dates[hour]-episode$begin[patient]
                        }
                }
        }
        nb_wait_total = rbind(nb_wait_total, nb_wait)
        delay_total = rbind(delay_total, delay)
        nb_wait = 0
        delay = 0
}

The first loop is to set the date and write the results. The second loop first if statement is to search in the episode data.frame which patients are in the ER during the time frame. The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.

It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:

if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)

Any idea how to clean my code so it doesn't take so much time? I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!

Thank you!

CodePudding user response：

Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!

Here is the code after modification:

for(hour in 1:122880){
        temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
        for(element in 1:length(temp){
                no_episode = temp[element]
                if(care$begin[care$id==no_episode]>dates[hour]{
                        nb_wait = nb_wait   1
                        delay = delay   dates[hour]-episode$begin[episode$id==no_episode]
                }
        }
        nb_wait_total = rbind(nb_wait_total, nb_wait)
        delay_total = rbind(delay_total, delay)
        nb_wait = 0
        delay = 0
}

Getting rid of the first if statement and the 2nd (longest) loop did the trick!

CodePudding user response：

Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.

First, here's some fake data. I assume here that "id" "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.

library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data

care <- tibble(
  id = sample(1:99999, epi_n, replace = TRUE),
  begin = ymd_h(2010010100)   runif(epi_n, max = 3.5*365*24*60*60),
  date = as_date(begin)) %>%
  filter(hour(begin) >= 7, hour(begin) <= 17) %>%
  distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers

episode <- care %>%
  transmute(id, date, 
            begin = begin - rpois(nrow(care), 10*60),
            end   = begin   rgamma(nrow(care), 1)*600) %>%
  arrange(begin)

Matching the data between the two data sets takes 0.2 sec.

tictoc::tic()
combined <- left_join(
  episode %>% rename("patient_arrival" = "begin"),
  care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()

Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.

Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.

tictoc::tic()
combined %>%
  select(id, patient_arrival, dr_arrival) %>%
  pivot_longer(-id, values_to = "hour") %>%
  mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
  arrange(hour) %>%
  mutate(wait_count = cumsum(wait_chg)) %>%
  bind_rows(.,
            tibble(name = "15 minute interval",
                   hour = seq.POSIXt(ymd_h(2010010100),
                                     to = ymd_h(2013070112), 
                                     by = "15 min")) %>%
              filter(hour(hour) >= 6, hour(hour) < 20)
  ) %>%
  arrange(hour) %>% 
  fill(wait_count) %>%
  replace_na(list(wait_count = 0)) %>%
  filter(name == "15 minute interval") %>%
  select(hour,wait_count)
tictoc::toc()