I'm looking for an R-package to group consecutive dates into periods. In addition, columns must be grouped by FID, PID and SETTING:
# Input data
input <- read.csv(text=
"FID,PID,SETTING,DATE
00001, 100001, ST, 2021-01-01
00001, 100001, ST, 2021-01-02
00001, 100001, ST, 2021-01-03
00001, 100002, AB, 2021-01-04
00001, 100001, ST, 2021-01-11
00001, 100001, ST, 2021-01-12
00002, 200001, AB, 2021-01-02
00002, 200001, AB, 2021-01-03
00002, 200001, AB, 2021-01-04
00002, 200002, TK, 2021-01-05"
)
# Expected output
output <- read.csv(text="
FID,PID,SETTING,START,END
00001, 100001, ST, 2021-01-01, 2021-01-03
00001, 100002, AB, 2021-01-04, 2021-01-04
00001, 100001, ST, 2021-01-11, 2021-01-12
00002, 200001, AB, 2021-01-02, 2021-01-04
00002, 200002, TK, 2021-01-05, 2021-01-05"
)
I 've to group around 700'000 lines. Therefore, the solution should be as performant as possible.
CodePudding user response:
base R
input <- input[order(input$DATE),]
input$grp <- ave(as.integer(input$DATE), input[-4], FUN = function(z) cumsum(c(TRUE, diff(z) > 1)))
input
# FID PID SETTING DATE grp
# 1 1 100001 ST 2021-01-01 1
# 2 1 100001 ST 2021-01-02 1
# 7 2 200001 AB 2021-01-02 1
# 3 1 100001 ST 2021-01-03 1
# 8 2 200001 AB 2021-01-03 1
# 4 1 100002 AB 2021-01-04 1
# 9 2 200001 AB 2021-01-04 1
# 10 2 200002 TK 2021-01-05 1
# 5 1 100001 ST 2021-01-11 1
# 6 1 100001 ST 2021-01-12 1
out <- aggregate(DATE ~ FID PID SETTING grp, data = input,
FUN = function(z) setNames(range(z), c("START","END")))
out <- do.call(data.frame, out)
out[,5:6] <- lapply(out[,5:6], as.Date, origin = "1970-01-01")
out
# FID PID SETTING grp DATE.START DATE.END
# 1 1 100002 AB 1 2021-01-04 2021-01-04
# 2 2 200001 AB 1 2021-01-02 2021-01-04
# 3 1 100001 ST 1 2021-01-01 2021-01-03
# 4 2 200002 TK 1 2021-01-05 2021-01-05
# 5 1 100001 ST 2 2021-01-11 2021-01-12
Walk-through:
- the ease of
cumsum
anddiff
is accomplished assuming that the dates are always ordered; it is not important (here) that the other grouping variables may be misordered; ave(..)
assigns groups of non-consecutive (diff over 1) dates, which we use in the next step;aggregate
calculates the range within each group, using your three variables plus our newgrp
grouping variable; eachz
in the anonymous function is a contiguous vector of dates, sorange
gives us that start/end dates;- unfortunately, aggregate is assigning a matrix as the fifth column instead of two separate columns, so
do.call(data.frame, out)
fixes that; - unfortunately, most base R aggregating functions tend to strip the
Date
(andPOSIXt
) class from the vectors, so we need to useas.Date
to heal that.
dplyr
library(dplyr)
input %>%
arrange(DATE) %>%
group_by(FID, PID, SETTING) %>%
mutate(grp = cumsum(c(TRUE, diff(DATE) > 1))) %>%
group_by(FID, PID, SETTING, grp) %>%
summarize(START = min(DATE), END = max(DATE)) %>%
ungroup()
# # A tibble: 5 x 6
# FID PID SETTING grp START END
# <int> <int> <chr> <int> <date> <date>
# 1 1 100001 " ST" 1 2021-01-01 2021-01-03
# 2 1 100001 " ST" 2 2021-01-11 2021-01-12
# 3 1 100002 " AB" 1 2021-01-04 2021-01-04
# 4 2 200001 " AB" 1 2021-01-02 2021-01-04
# 5 2 200002 " TK" 1 2021-01-05 2021-01-05
data.table
library(data.table)
inputDT <- as.data.table(input)
setorder(inputDT, DATE)
inputDT[, grp := cumsum(c(TRUE, diff(DATE) > 1)), by = .(FID, PID, SETTING)
][, .(START = min(DATE), END = max(DATE)), by = .(FID, PID, SETTING, grp)
][]
# FID PID SETTING grp START END
# <int> <int> <char> <int> <Date> <Date>
# 1: 1 100001 ST 1 2021-01-01 2021-01-03
# 2: 2 200001 AB 1 2021-01-02 2021-01-04
# 3: 1 100002 AB 1 2021-01-04 2021-01-04
# 4: 2 200002 TK 1 2021-01-05 2021-01-05
# 5: 1 100001 ST 2 2021-01-11 2021-01-12