Efficiently summarize data over many different time periods by additional criteria (person and locat-CodePudding

I have pollution measures by day and location. I have a population of people for which I want to measure pollution exposure. Each person has a location and a period of time during which they were in that location.

For each person in my dataset, I need to sum up the pollution values from their location over their time period, and also count the number of missing pollution measurements.

The table structures are the following:

  ids start_dates  end_dates  zips
1   1  2000-10-10 2001-02-18 45108
2   2  2000-11-11 2001-04-07 45190
3   3  2000-03-05 2000-06-27 45117
4   4  2001-02-04 2001-06-09 45142
5   5  2000-03-16 2000-07-13 45197
6   6  1999-12-15 2000-04-27 45060

  exposure_day exposure_zip exposure_value
1   1999-06-26        45108             14
2   1999-06-27        45108             27
3   1999-06-28        45108             22
4   1999-06-29        45108              4
5   1999-06-30        45108             26
6   1999-07-01        45108             20

Desired output:

   ids start_dates  end_dates  zips exposure_sum na_count
1:   1  2000-10-10 2001-02-18 45108         3188        5
2:   2  2000-11-11 2001-04-07 45190         3789        1
3:   3  2000-03-05 2000-06-27 45117         2917        3
4:   4  2001-02-04 2001-06-09 45142         2969        2
5:   5  2000-03-16 2000-07-13 45197         2860        3
6:   6  1999-12-15 2000-04-27 45060         3497        2

My current solution is quite slow. I would like to find a more efficient solution, so that I can efficiently perform this calculation for roughly 1,000,000 people.

Below is the code to simulate my data and my current solution.

set.seed(123)
library(lubridate)
library(data.table)

# Make person dataframe
n = 1000  # sample size
ids = c(1:n)
end_dates = sample(seq(as.Date('2000-01-01'), as.Date('2002-01-01'), by="day"), n, replace = T)
time_intervals = sample(seq(100, 200), n, replace = T)
start_dates = end_dates - time_intervals
zips = sample(seq(45000, 45200), n, replace = T)

person_df = data.frame(ids, start_dates, end_dates,  zips)

# Make exposure dataframe
ziplist = unique(zips)
nzips = length(ziplist)
ndays = as.numeric(as.Date(max(person_df$end_dates)) - as.Date(min(person_df$start_dates))   1)
exposure_dates = seq(as.Date(min(person_df$start_dates)), as.Date(max(person_df$end_dates)), by = 'day')
exposure_day = rep(exposure_dates, nzips)
exposure_zip = rep(ziplist, each = ndays)
exposure_value = sample(c(NA, 1:50), length(exposure_day), replace = T)

exposure_df = data.frame(exposure_day, exposure_zip, exposure_value)

# convert to datatable
person_dt = data.table(person_df)
exposure_dt = data.table(exposure_df) 

#summarize
summary_dt = person_dt[, ":="(exposure_sum = .(sum(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value], na.rm = T)),
                                na_count =    .(sum(is.na(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value])))),
                        by = 'ids'][]

CodePudding user response：

This dplyr approach is about 40x faster at n=1000, 50x at n=10k, and 60x faster at n=100k, with the same output. The main gain is from translating the non-equi join into a left join by expanding person_df to have one row for each exposure_day in each ids range. That upfront step of expanding all the dates can be done once to make the subsequent joins dramatically faster.

It takes about 2 minutes when I run this for n=1,000,000, which I presume would take around 2 hours using the original code. I imagine further improvements could be made by porting to data.table or collapse if that isn't fast enough.

person_df %>% 
  group_by(ids, exposure_zip = zips) %>%
  summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
  left_join(exposure_df) %>%
  group_by(ids) %>%
  summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
            na_count = sum(is.na(exposure_value))) %>%
  # optional to add start dates end dates, zips columns back
  left_join(person_df)

Update: porting to data.table using dtplyr improved the 1M row test slightly, to 100 seconds on my machine.

library(dtplyr)
person_df %>%
  lazy_dt() %>%
  group_by(ids, exposure_zip = zips) %>%
  summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
  left_join(exposure_df) %>%
  group_by(ids) %>%
  summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
            na_count = sum(is.na(exposure_value)), .groups = "drop") %>%
  left_join(person_df) %>%
  collect()

CodePudding user response：

Here is another approach you could consider. Use cumulative sums across the entire range of days, and then subtract the cumulative sum at the end date for an id from the cumulative sum at the start date for an id. This is quite fast:

Step 1: Create the cumulative sum of exposure values (cval) and number of NA values (nas)

exposure_dt[order(exposure_day), `:=`(
  cval=cumsum(fifelse(is.na(exposure_value),0,exposure_value)),
  nas = cumsum(is.na(exposure_value))
  ),exposure_zip]

Step 2: Simply melt the person_dt frame, and do direct merge on the exposure_dt frame. Make sure to subtract the exposure value from the cumulative sum if this is a start day and exposure value is not NA; similarly subtract one from nas if this is a start day and exposure value is NA.

k <- melt(person_dt,id.vars = c("ids","zips")) %>% 
  .[exposure_dt, on=.(zips==exposure_zip, value=exposure_day), nomatch=0] %>% 
  .[variable=="start_dates" & is.na(exposure_value), nas:=nas-1] %>% 
  .[variable=="start_dates" & !is.na(exposure_value),cval:=cval-exposure_value] %>% 
  .[order(ids,value)]

Step 3: Simply subtract the odd rows from the even rows, and cbind the result to the person_dt

cbind(
  person_dt,
  k[seq(2,.N,2),.(cval,nas)] - k[seq(1,.N,2),.(cval,nas)]
)

All of this takes 0.08 seconds on my machine with the original 1000 ids dataset.

Output:

        ids start_dates  end_dates  zips  cval   nas
      <int>      <Date>     <Date> <int> <num> <int>
   1:     1  2000-10-10 2001-02-18 45108  3188     5
   2:     2  2000-11-11 2001-04-07 45190  3789     1
   3:     3  2000-03-05 2000-06-27 45117  2917     3
   4:     4  2001-02-04 2001-06-09 45142  2969     2
   5:     5  2000-03-16 2000-07-13 45197  2860     3
  ---                                               
 996:   996  2000-02-21 2000-07-29 45139  4250     2
 997:   997  2000-02-02 2000-07-15 45074  4407     4
 998:   998  2001-07-29 2001-11-15 45139  2686     3
 999:   999  2001-09-10 2001-12-20 45127  2581     1
1000:  1000  2000-10-15 2001-05-01 45010  4941     2