I have pollution measures by day and location. I have a population of people for which I want to measure pollution exposure. Each person has a location and a period of time during which they were in that location.
For each person in my dataset, I need to sum up the pollution values from their location over their time period, and also count the number of missing pollution measurements.
The table structures are the following:
ids start_dates end_dates zips
1 1 2000-10-10 2001-02-18 45108
2 2 2000-11-11 2001-04-07 45190
3 3 2000-03-05 2000-06-27 45117
4 4 2001-02-04 2001-06-09 45142
5 5 2000-03-16 2000-07-13 45197
6 6 1999-12-15 2000-04-27 45060
exposure_day exposure_zip exposure_value
1 1999-06-26 45108 14
2 1999-06-27 45108 27
3 1999-06-28 45108 22
4 1999-06-29 45108 4
5 1999-06-30 45108 26
6 1999-07-01 45108 20
Desired output:
ids start_dates end_dates zips exposure_sum na_count
1: 1 2000-10-10 2001-02-18 45108 3188 5
2: 2 2000-11-11 2001-04-07 45190 3789 1
3: 3 2000-03-05 2000-06-27 45117 2917 3
4: 4 2001-02-04 2001-06-09 45142 2969 2
5: 5 2000-03-16 2000-07-13 45197 2860 3
6: 6 1999-12-15 2000-04-27 45060 3497 2
My current solution is quite slow. I would like to find a more efficient solution, so that I can efficiently perform this calculation for roughly 1,000,000 people.
Below is the code to simulate my data and my current solution.
set.seed(123)
library(lubridate)
library(data.table)
# Make person dataframe
n = 1000 # sample size
ids = c(1:n)
end_dates = sample(seq(as.Date('2000-01-01'), as.Date('2002-01-01'), by="day"), n, replace = T)
time_intervals = sample(seq(100, 200), n, replace = T)
start_dates = end_dates - time_intervals
zips = sample(seq(45000, 45200), n, replace = T)
person_df = data.frame(ids, start_dates, end_dates, zips)
# Make exposure dataframe
ziplist = unique(zips)
nzips = length(ziplist)
ndays = as.numeric(as.Date(max(person_df$end_dates)) - as.Date(min(person_df$start_dates)) 1)
exposure_dates = seq(as.Date(min(person_df$start_dates)), as.Date(max(person_df$end_dates)), by = 'day')
exposure_day = rep(exposure_dates, nzips)
exposure_zip = rep(ziplist, each = ndays)
exposure_value = sample(c(NA, 1:50), length(exposure_day), replace = T)
exposure_df = data.frame(exposure_day, exposure_zip, exposure_value)
# convert to datatable
person_dt = data.table(person_df)
exposure_dt = data.table(exposure_df)
#summarize
summary_dt = person_dt[, ":="(exposure_sum = .(sum(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value], na.rm = T)),
na_count = .(sum(is.na(exposure_dt[exposure_day>=start_dates & exposure_day<=end_dates & exposure_zip == zips, exposure_value])))),
by = 'ids'][]
CodePudding user response:
This dplyr approach is about 40x faster at n=1000, 50x at n=10k, and 60x faster at n=100k, with the same output. The main gain is from translating the non-equi join into a left join by expanding person_df
to have one row for each exposure_day
in each ids
range. That upfront step of expanding all the dates can be done once to make the subsequent joins dramatically faster.
It takes about 2 minutes when I run this for n=1,000,000, which I presume would take around 2 hours using the original code. I imagine further improvements could be made by porting to data.table
or collapse
if that isn't fast enough.
person_df %>%
group_by(ids, exposure_zip = zips) %>%
summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
left_join(exposure_df) %>%
group_by(ids) %>%
summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
na_count = sum(is.na(exposure_value))) %>%
# optional to add start dates end dates, zips columns back
left_join(person_df)
Update: porting to data.table
using dtplyr
improved the 1M row test slightly, to 100 seconds on my machine.
library(dtplyr)
person_df %>%
lazy_dt() %>%
group_by(ids, exposure_zip = zips) %>%
summarize(exposure_day = seq.Date(start_dates, end_dates, by = "day"), .groups = "drop") %>%
left_join(exposure_df) %>%
group_by(ids) %>%
summarize(exposure_sum = sum(exposure_value, na.rm = TRUE),
na_count = sum(is.na(exposure_value)), .groups = "drop") %>%
left_join(person_df) %>%
collect()
CodePudding user response:
Here is another approach you could consider. Use cumulative sums across the entire range of days, and then subtract the cumulative sum at the end date for an id from the cumulative sum at the start date for an id. This is quite fast:
Step 1: Create the cumulative sum of exposure values (cval
) and number of NA
values (nas
)
exposure_dt[order(exposure_day), `:=`(
cval=cumsum(fifelse(is.na(exposure_value),0,exposure_value)),
nas = cumsum(is.na(exposure_value))
),exposure_zip]
Step 2: Simply melt the person_dt
frame, and do direct merge on the exposure_dt
frame. Make sure to subtract the exposure value from the cumulative sum if this is a start day and exposure value is not NA; similarly subtract one from nas
if this is a start day and exposure value is NA
.
k <- melt(person_dt,id.vars = c("ids","zips")) %>%
.[exposure_dt, on=.(zips==exposure_zip, value=exposure_day), nomatch=0] %>%
.[variable=="start_dates" & is.na(exposure_value), nas:=nas-1] %>%
.[variable=="start_dates" & !is.na(exposure_value),cval:=cval-exposure_value] %>%
.[order(ids,value)]
Step 3: Simply subtract the odd rows from the even rows, and cbind
the result to the person_dt
cbind(
person_dt,
k[seq(2,.N,2),.(cval,nas)] - k[seq(1,.N,2),.(cval,nas)]
)
All of this takes 0.08 seconds on my machine with the original 1000 ids dataset.
Output:
ids start_dates end_dates zips cval nas
<int> <Date> <Date> <int> <num> <int>
1: 1 2000-10-10 2001-02-18 45108 3188 5
2: 2 2000-11-11 2001-04-07 45190 3789 1
3: 3 2000-03-05 2000-06-27 45117 2917 3
4: 4 2001-02-04 2001-06-09 45142 2969 2
5: 5 2000-03-16 2000-07-13 45197 2860 3
---
996: 996 2000-02-21 2000-07-29 45139 4250 2
997: 997 2000-02-02 2000-07-15 45074 4407 4
998: 998 2001-07-29 2001-11-15 45139 2686 3
999: 999 2001-09-10 2001-12-20 45127 2581 1
1000: 1000 2000-10-15 2001-05-01 45010 4941 2