I have a dataset of hospital visits and patient IDs. For each patient, I want to add a number of new variables, for instance the number of prior visits and the number of prior visits within a year of the current visit. A simplified version of the data looks like this.
id visit_date
122 2009-01-01
122 2009-10-23
122 2009-12-18
324 2015-01-07
324 2016-03-04
454 2010-05-03
3332 2012-07-03
3332 2013-02-09
Right now I'm grouping the data by id, arranging by descending date, and using a loop to check all previous rows for a given row within a group. There are a few million rows so this ends up taking days, especially when there are multiple conditions to check. Is there a more efficient way to do this? Desired output would be
id visit_date prior_visits prior_one_year_visits
122 2009-01-01 0 0
122 2009-10-23 1 1
122 2009-12-18 2 2
324 2015-01-07 0 0
324 2016-03-04 1 0
454 2010-05-03 0 0
3332 2012-07-03 0 0
3332 2013-02-09 1 1
CodePudding user response:
Up front: the most performant solutions here appear to be (in order): sqldf
, data.table
, and base R, where fuzzyjoin
works with small data but not larger data. The rest of the answer kept in case smaller solutions (and package-limiting) push for more readable or more comfortable code.
This can be done in base R rather efficiently:
Up front: I'm assuming class(dat$visit_date)
is Date
.
library(lubridate) # %m-%, years
dat$prior_visits <- ave(
as.integer(dat$visit_date), dat$id,
FUN = function(z) rowSums(as.matrix(sapply(z, `<`, z))))
dat$prior_oneyear <- as.integer(ave(
dat$visit_date, dat$id,
FUN = function(z) as.Date(rowSums(
as.matrix(sapply(z, `<=`, z)) & as.matrix(sapply(z, `>=`, z %m-% years(1)))) - 1,
origin = "1970-01-01") ))
dat
# id visit_date prior_visits prior_oneyear
# 1 122 2009-01-01 0 0
# 2 122 2009-10-23 1 1
# 3 122 2009-12-18 2 2
# 4 324 2015-01-07 0 0
# 5 324 2016-03-04 1 0
# 6 454 2010-05-03 0 0
# 7 3332 2012-07-03 0 0
# 8 3332 2013-02-09 1 1
Notes:
the
as.matrix
is to forcesapply
to return a matrix even if the data is only length-1 (for whichsapply
typically returns just avector
):rowSums
requires a matrix, so we ensure it is one. If length is greater than 1, then the call toas.matrix
is a no-op.ave
requires that its return-value be the sameclass
as its first argument. Because of this, we wrap the first call withas.integer(dat$visit_date)
so that the return value will be ints; this is fine for basic<
-logic. Unfortunately, in order to do "proper" one-year subtraction, we can't convert it to a number without losing the leap-year possibility; because of this, we keep it as aDate
, do the in-function year-subtraction withlubridate::%m-%
, then force this number to aDate
withorigin=
and then convert back to an integer. It seems roundabout, but it will give the correct count even in the presence of leap-years.
This can also be solved with a range-based join, which is supported by sqldf
, data.table
, and fuzzyjoin
, but not natively/directly via dplyr
(though workarounds are certainly possible similar to the base R ave
implementation above).
All code produces the same output as above, albeit sometimes in different data.frame
-dialects.
sqldf
sqldf::sqldf("
with cte as (
select t1.id, t1.visit_date, count(t3.visit_date) as prior_oneyear
from dat t1
left join dat t3 on t1.id = t3.id
and t3.visit_date between (t1.visit_date-365) and (t1.visit_date-1)
group by t1.id, t1.visit_date
)
select t1.id, t1.visit_date, count(t2.visit_date) as prior_visits,
cte.prior_oneyear
from dat t1
left join dat t2 on t1.id = t2.id and t1.visit_date > t2.visit_date
left join cte on t1.id = cte.id and t1.visit_date = cte.visit_date
group by t1.id, t1.visit_date")
data.table
library(data.table)
library(lubridate) # %m-%, years
DT <- as.data.table(dat)
DT[, oneyear := visit_date %m-% years(1)
][, prior_visits := DT[DT, .N, on = .(id, visit_date < visit_date), by = .EACHI]$N
][, prior_oneyear := DT[DT, .N, on = .(id, visit_date >= oneyear, visit_date < visit_date), by = .EACHI]$N
][, oneyear := NULL][]
fuzzyjoin
library(dplyr)
dat %>%
fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "visit_date" = "visit_date"), match_fun = list(`==`, `>=`)) %>%
group_by(id = id.x, visit_date = visit_date.x) %>%
summarize(prior_visits = sum(visit_date.y < visit_date.x), .groups = "drop") %>%
mutate(oneyear = visit_date %m-% years(1)) %>%
fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "oneyear" = "visit_date", "visit_date" = "visit_date"), match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(id = id.x, visit_date = visit_date.x) %>%
summarize(prior_visits = prior_visits.x[1], prior_oneyear = sum(visit_date.y < visit_date.x)) %>%
ungroup()
Benchmark
While not a very good measure given such small data, one gets the hint of efficiency here.
bench::mark(
base = {
dat$prior_visits <- ave(
as.integer(dat$visit_date), dat$id,
FUN = function(z) rowSums(as.matrix(sapply(z, `<`, z))));
dat$prior_oneyear <- as.integer(ave(
dat$visit_date, dat$id,
FUN = function(z) as.Date(rowSums(
as.matrix(sapply(z, `<=`, z)) & as.matrix(sapply(z, `>=`, z %m-% years(1)))) - 1,
origin = "1970-01-01") ));
},
sqldf = sqldf::sqldf("
with cte as (
select t1.id, t1.visit_date, count(t3.visit_date) as prior_oneyear
from dat t1
left join dat t3 on t1.id = t3.id
and t3.visit_date between (t1.visit_date-365) and (t1.visit_date-1)
group by t1.id, t1.visit_date
)
select t1.id, t1.visit_date, count(t2.visit_date) as prior_visits,
cte.prior_oneyear
from dat t1
left join dat t2 on t1.id = t2.id and t1.visit_date > t2.visit_date
left join cte on t1.id = cte.id and t1.visit_date = cte.visit_date
group by t1.id, t1.visit_date"),
data.table = DT[, oneyear := visit_date %m-% years(1)
][, prior_visits := DT[DT, .N, on = .(id, visit_date < visit_date), by = .EACHI]$N
][, prior_oneyear := DT[DT, .N, on = .(id, visit_date >= oneyear, visit_date < visit_date), by = .EACHI]$N
][, oneyear := NULL][],
fuzzyjoin = dat %>%
fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "visit_date" = "visit_date"), match_fun = list(`==`, `>=`)) %>%
group_by(id = id.x, visit_date = visit_date.x) %>%
summarize(prior_visits = sum(visit_date.y < visit_date.x), .groups = "drop") %>%
mutate(oneyear = visit_date %m-% years(1)) %>%
fuzzyjoin::fuzzy_inner_join(dat, by = c("id" = "id", "oneyear" = "visit_date", "visit_date" = "visit_date"), match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(id = id.x, visit_date = visit_date.x) %>%
summarize(prior_visits = prior_visits.x[1], prior_oneyear = sum(visit_date.y < visit_date.x)) %>%
ungroup(),
check = FALSE, min_iterations = 100)
Results:
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 base 4.19ms 4.86ms 196. 59KB 3.99 98 2 501.15ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
2 sqldf 9.96ms 11.79ms 81.3 132KB 2.51 97 3 1.19s <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
3 data.table 5.22ms 7.91ms 118. 371KB 2.40 98 2 833.84ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
4 fuzzyjoin 161.5ms 185.19ms 5.32 320KB 5.32 50 50 9.39s <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
For a more realistic test, I'll explode the data a little:
bigdat <- do.call(rbind, lapply(1:10000, function(i) transform(dat, id = id i*1000)))
DT <- as.data.table(bigdat)
nrow(bigdat)
# [1] 80000
Unfortunately, the base R solution takes a few minutes (though still works) and fuzzyjoin
errors with memory-allocation problems. That leaves just sqldf
and data.table
:
bench::mark(
sqldf = sqldf::sqldf("
with cte as (
select t1.id, t1.visit_date, count(t3.visit_date) as prior_oneyear
from bigdat t1
left join bigdat t3 on t1.id = t3.id
and t3.visit_date between (t1.visit_date-365) and (t1.visit_date-1)
group by t1.id, t1.visit_date
)
select t1.id, t1.visit_date, count(t2.visit_date) as prior_visits,
cte.prior_oneyear
from bigdat t1
left join bigdat t2 on t1.id = t2.id and t1.visit_date > t2.visit_date
left join cte on t1.id = cte.id and t1.visit_date = cte.visit_date
group by t1.id, t1.visit_date"),
data.table = bigDT[, oneyear := visit_date %m-% years(1)
][, prior_visits := bigDT[bigDT, .N, on = .(id, visit_date < visit_date), by = .EACHI]$N
][, prior_oneyear := bigDT[bigDT, .N, on = .(id, visit_date >= oneyear, visit_date < visit_date), by = .EACHI]$N
][, oneyear := NULL][],
check = FALSE)
And those results:
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 sqldf 461ms 473ms 2.11 5.52MB 0 2 0 946ms <NULL> <Rprofmem[,3] ~ <bch:tm~ <tibble [~
2 data.table 307ms 319ms 3.14 30.58MB 0 2 0 637ms <NULL> <Rprofmem[,3] ~ <bch:tm~ <tibble [~
Suggesting that they are both about the same in speed/efficiency, but sqldf
might be more memory-frugal. I would think that as the data gets much larger, the relative performance may change in relative importance; if you compare any of these code-blocks with your real (1M row) data, please come back and comment on relative performance.
Data
dat <- structure(list(id = c(122L, 122L, 122L, 324L, 324L, 454L, 3332L, 3332L), visit_date = structure(c(14245, 14540, 14596, 16442, 16864, 14732, 15524, 15745), class = "Date")), class = "data.frame", row.names = c(NA, -8L))
CodePudding user response:
Does this work:
library(dplyr)
library(lubridate)
df %>% mutate(visit_date = as.Date(visit_date)) %>% group_by(id) %>% mutate(prior_visits = row_number() - 1,
prior_one_year_visits = year(visit_date) - lag(year(visit_date), default = min(year(visit_date))))
# A tibble: 8 x 4
# Groups: id [4]
id visit_date prior_visits prior_one_year_visits
<int> <date> <dbl> <dbl>
1 122 2009-01-01 0 0
2 122 2009-10-23 1 0
3 122 2009-12-18 2 0
4 324 2015-01-07 0 0
5 324 2016-03-04 1 1
6 454 2010-05-03 0 0
7 3332 2012-07-03 0 0
8 3332 2013-02-09 1 1