Conditional grouped cumulative sum-CodePudding

I have a dataset of hospital visits and patient IDs. For each patient, I want to add a number of new variables, for instance the number of prior visits and the number of prior visits within a year of the current visit. A simplified version of the data looks like this.

id       visit_date 
122      2009-01-01      
122      2009-10-23  
122      2009-12-18        
324      2015-01-07        
324      2016-03-04     
454      2010-05-03       
3332     2012-07-03
3332     2013-02-09

Right now I'm grouping the data by id, arranging by descending date, and using a loop to check all previous rows for a given row within a group. There are a few million rows so this ends up taking days, especially when there are multiple conditions to check. Is there a more efficient way to do this? Desired output would be

id       visit_date    prior_visits  prior_one_year_visits
122      2009-01-01           0             0
122      2009-10-23           1             1
122      2009-12-18           2             2
324      2015-01-07           0             0
324      2016-03-04           1             0
454      2010-05-03           0             0
3332     2012-07-03           0             0
3332     2013-02-09           1             1

CodePudding user response：

Up front: the most performant solutions here appear to be (in order): sqldf, data.table, and base R, where fuzzyjoin works with small data but not larger data. The rest of the answer kept in case smaller solutions (and package-limiting) push for more readable or more comfortable code.

This can be done in base R rather efficiently:

Up front: I'm assuming class(dat$visit_date) is Date.

library(lubridate) # %m-%, years
dat$prior_visits <- ave(
  as.integer(dat$visit_date), dat$id,
  FUN = function(z) rowSums(as.matrix(sapply(z, `<`, z))))
dat$prior_oneyear <- as.integer(ave(
  dat$visit_date, dat$id,
  FUN = function(z) as.Date(rowSums(
    as.matrix(sapply(z, `<=`, z)) & as.matrix(sapply(z, `>=`, z %m-% years(1)))) - 1,
    origin = "1970-01-01") ))
dat
#     id visit_date prior_visits prior_oneyear
# 1  122 2009-01-01            0             0
# 2  122 2009-10-23            1             1
# 3  122 2009-12-18            2             2
# 4  324 2015-01-07            0             0
# 5  324 2016-03-04            1             0
# 6  454 2010-05-03            0             0
# 7 3332 2012-07-03            0             0
# 8 3332 2013-02-09            1             1

Notes:

the as.matrix is to force sapply to return a matrix even if the data is only length-1 (for which sapply typically returns just a vector): rowSums requires a matrix, so we ensure it is one. If length is greater than 1, then the call to as.matrix is a no-op.
ave requires that its return-value be the same class as its first argument. Because of this, we wrap the first call with as.integer(dat$visit_date) so that the return value will be ints; this is fine for basic <-logic. Unfortunately, in order to do "proper" one-year subtraction, we can't convert it to a number without losing the leap-year possibility; because of this, we keep it as a Date, do the in-function year-subtraction with lubridate::%m-%, then force this number to a Date with origin= and then convert back to an integer. It seems roundabout, but it will give the correct count even in the presence of leap-years.

This can also be solved with a range-based join, which is supported by sqldf, data.table, and fuzzyjoin, but not natively/directly via dplyr (though workarounds are certainly possible similar to the base R ave implementation above).

All code produces the same output as above, albeit sometimes in different data.frame-dialects.

sqldf

sqldf::sqldf("
  with cte as (
    select t1.id, t1.visit_date, count(t3.visit_date) as prior_oneyear 
    from dat t1 
      left join dat t3 on t1.id = t3.id 
        and t3.visit_date between (t1.visit_date-365) and (t1.visit_date-1) 
    group by t1.id, t1.visit_date
  )
  select t1.id, t1.visit_date, count(t2.visit_date) as prior_visits, 
    cte.prior_oneyear 
  from dat t1 
    left join dat t2 on t1.id = t2.id and t1.visit_date > t2.visit_date 
    left join cte on t1.id = cte.id and t1.visit_date = cte.visit_date 
  group by t1.id, t1.visit_date")

data.table

library(data.table)
library(lubridate) # %m-%, years
DT <- as.data.table(dat)
DT[, oneyear := visit_date %m-% years(1)
  ][, prior_visits := DT[DT, .N, on = .(id, visit_date < visit_date), by = .EACHI]$N
  ][, prior_oneyear := DT[DT, .N, on = .(id, visit_date >= oneyear, visit_date < visit_date), by = .EACHI]$N 
  ][, oneyear := NULL][]

fuzzyjoin

library(dplyr)
dat %>%
  fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "visit_date" = "visit_date"), match_fun = list(`==`, `>=`)) %>%
  group_by(id = id.x, visit_date = visit_date.x) %>%
  summarize(prior_visits = sum(visit_date.y < visit_date.x), .groups = "drop") %>%
  mutate(oneyear = visit_date %m-% years(1)) %>%
  fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "oneyear" = "visit_date", "visit_date" = "visit_date"), match_fun = list(`==`, `<=`, `>=`)) %>%
  group_by(id = id.x, visit_date = visit_date.x) %>%
  summarize(prior_visits = prior_visits.x[1], prior_oneyear = sum(visit_date.y < visit_date.x)) %>%
  ungroup()

Benchmark

While not a very good measure given such small data, one gets the hint of efficiency here.

bench::mark(
base = {
  dat$prior_visits <- ave(
    as.integer(dat$visit_date), dat$id,
    FUN = function(z) rowSums(as.matrix(sapply(z, `<`, z))));
  dat$prior_oneyear <- as.integer(ave(
    dat$visit_date, dat$id,
    FUN = function(z) as.Date(rowSums(
      as.matrix(sapply(z, `<=`, z)) & as.matrix(sapply(z, `>=`, z %m-% years(1)))) - 1,
      origin = "1970-01-01") ));
  },
sqldf = sqldf::sqldf("
  with cte as (
    select t1.id, t1.visit_date, count(t3.visit_date) as prior_oneyear 
    from dat t1 
      left join dat t3 on t1.id = t3.id 
        and t3.visit_date between (t1.visit_date-365) and (t1.visit_date-1) 
    group by t1.id, t1.visit_date
  )
  select t1.id, t1.visit_date, count(t2.visit_date) as prior_visits, 
    cte.prior_oneyear 
  from dat t1 
    left join dat t2 on t1.id = t2.id and t1.visit_date > t2.visit_date 
    left join cte on t1.id = cte.id and t1.visit_date = cte.visit_date 
  group by t1.id, t1.visit_date"),
data.table = DT[, oneyear := visit_date %m-% years(1)
    ][, prior_visits := DT[DT, .N, on = .(id, visit_date < visit_date), by = .EACHI]$N
    ][, prior_oneyear := DT[DT, .N, on = .(id, visit_date >= oneyear, visit_date < visit_date), by = .EACHI]$N 
    ][, oneyear := NULL][],
fuzzyjoin = dat %>%
    fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "visit_date" = "visit_date"), match_fun = list(`==`, `>=`)) %>%
    group_by(id = id.x, visit_date = visit_date.x) %>%
    summarize(prior_visits = sum(visit_date.y < visit_date.x), .groups = "drop") %>%
    mutate(oneyear = visit_date %m-% years(1)) %>%
    fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "oneyear" = "visit_date", "visit_date" = "visit_date"), match_fun = list(`==`, `<=`, `>=`)) %>%
    group_by(id = id.x, visit_date = visit_date.x) %>%
    summarize(prior_visits = prior_visits.x[1], prior_oneyear = sum(visit_date.y < visit_date.x)) %>%
    ungroup(),
check = FALSE, min_iterations = 100)

Results:

# A tibble: 4 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time      gc       
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>    <list>   
1 base         4.19ms   4.86ms    196.        59KB     3.99    98     2   501.15ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
2 sqldf        9.96ms  11.79ms     81.3      132KB     2.51    97     3      1.19s <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
3 data.table   5.22ms   7.91ms    118.       371KB     2.40    98     2   833.84ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
4 fuzzyjoin   161.5ms 185.19ms      5.32     320KB     5.32    50    50      9.39s <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~

For a more realistic test, I'll explode the data a little:

bigdat <- do.call(rbind, lapply(1:10000, function(i) transform(dat, id = id   i*1000)))
DT <- as.data.table(bigdat)
nrow(bigdat)
# [1] 80000

Unfortunately, the base R solution takes a few minutes (though still works) and fuzzyjoin errors with memory-allocation problems. That leaves just sqldf and data.table:

bench::mark(
sqldf = sqldf::sqldf("
  with cte as (
    select t1.id, t1.visit_date, count(t3.visit_date) as prior_oneyear 
    from bigdat t1 
      left join bigdat t3 on t1.id = t3.id 
        and t3.visit_date between (t1.visit_date-365) and (t1.visit_date-1) 
    group by t1.id, t1.visit_date
  )
  select t1.id, t1.visit_date, count(t2.visit_date) as prior_visits, 
    cte.prior_oneyear 
  from bigdat t1 
    left join bigdat t2 on t1.id = t2.id and t1.visit_date > t2.visit_date 
    left join cte on t1.id = cte.id and t1.visit_date = cte.visit_date 
  group by t1.id, t1.visit_date"),
data.table = bigDT[, oneyear := visit_date %m-% years(1)
  ][, prior_visits := bigDT[bigDT, .N, on = .(id, visit_date < visit_date), by = .EACHI]$N
  ][, prior_oneyear := bigDT[bigDT, .N, on = .(id, visit_date >= oneyear, visit_date < visit_date), by = .EACHI]$N 
  ][, oneyear := NULL][],
check = FALSE)

And those results:

# A tibble: 2 x 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory          time     gc        
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>          <list>   <list>    
1 sqldf        461ms   473ms      2.11    5.52MB        0     2     0      946ms <NULL> <Rprofmem[,3] ~ <bch:tm~ <tibble [~
2 data.table   307ms   319ms      3.14   30.58MB        0     2     0      637ms <NULL> <Rprofmem[,3] ~ <bch:tm~ <tibble [~

Suggesting that they are both about the same in speed/efficiency, but sqldf might be more memory-frugal. I would think that as the data gets much larger, the relative performance may change in relative importance; if you compare any of these code-blocks with your real (1M row) data, please come back and comment on relative performance.

Data

dat <- structure(list(id = c(122L, 122L, 122L, 324L, 324L, 454L, 3332L, 3332L), visit_date = structure(c(14245, 14540, 14596, 16442, 16864, 14732, 15524, 15745), class = "Date")), class = "data.frame", row.names = c(NA, -8L))

CodePudding user response：

Does this work:

library(dplyr)
library(lubridate)
df %>% mutate(visit_date = as.Date(visit_date)) %>% group_by(id) %>% mutate(prior_visits = row_number() - 1,
                                                                            prior_one_year_visits = year(visit_date) - lag(year(visit_date), default = min(year(visit_date))))
# A tibble: 8 x 4
# Groups:   id [4]
     id visit_date prior_visits prior_one_year_visits
  <int> <date>            <dbl>                 <dbl>
1   122 2009-01-01            0                     0
2   122 2009-10-23            1                     0
3   122 2009-12-18            2                     0
4   324 2015-01-07            0                     0
5   324 2016-03-04            1                     1
6   454 2010-05-03            0                     0
7  3332 2012-07-03            0                     0
8  3332 2013-02-09            1                     1