I am working in R
and I have some time-series data where the relevant time is a date range --- i.e., each row has both a start date and an end date and the record specifies outcomes over that period. I would like to be able to represent this object in an appropriate way in R
and merge it with other time-series that have specific dates. In the case of the merge, I would like each record in my first time-series (the one using a date range) to be matched with a date falling within the range, with appropriate tie-breakers in the case where there are two or more specific dates that fall within the range. Here is a simple example of what I want it to do using mock data:
#Time series with a date-range
TIME.SERIES1
Start End Value
2018-01-01 2018-01-31 80.6
2018-02-01 2018-02-28 140.5
2018-03-01 2018-03-31 122.1
2018-04-01 2018-04-30 108.5
2018-05-01 2018-05-31 96.0
2018-06-01 2018-06-30 103.3
#Time series using specific dates
TIME.SERIES2
Date Staff
2018-01-15 551
2018-02-14 606
2018-05-18 640
2018-06-02 635
2018-07-15 633
#Merge these using left-join of TIME.SERIES1 and TIME.SERIES2
[some merge command]
Start End Value Staff
2018-01-01 2018-01-31 80.6 551
2018-02-01 2018-02-28 140.5 606
2018-03-01 2018-03-31 122.1 NA
2018-04-01 2018-04-30 108.5 NA
2018-05-01 2018-05-31 96.0 640
2018-06-01 2018-06-30 103.3 635
I know that you can do this type of merge fairly easily in SQL
but I'm not sure how to do it in R
. I have previously used the xls
package for time-series analysis, but I don't know if it can accomplish what I want to do. This package appears to allow me to specify a single date as the index, but not a date range. Similarly, it appears to allow me to merge two time-series objects on a single date, but I'm not sure how to merge based on a date range (in the manner shown above).
Questions: Can the time-series facilities in R
handle an object where the time-index is a date range rather than a specific date (i.e., can you have an index consisting of a start and end time)? How do we merge data with a date range against date with individual dates?
CodePudding user response:
Since the periods are always full single months and assuming that the input is the data frames shown in the Note at the end where we have assumed all dates are of Date class, add a yearmon class column to both and join on that.
library(zoo)
ts1$ym <- as.yearmon(ts1$Start)
ts2$ym <- as.yearmon(ts2$Date)
library(dplyr)
left_join(ts1, ts2, "ym")
giving:
Start End Value ym Date Staff
1 2018-01-01 2018-01-31 80.6 Jan 2018 2018-01-15 551
2 2018-02-01 2018-02-28 140.5 Feb 2018 2018-02-14 606
3 2018-03-01 2018-03-31 122.1 Mar 2018 <NA> NA
4 2018-04-01 2018-04-30 108.5 Apr 2018 <NA> NA
5 2018-05-01 2018-05-31 96.0 May 2018 2018-05-18 640
6 2018-06-01 2018-06-30 103.3 Jun 2018 2018-06-02 635
Note
TIME.SERIES1
Lines1 <- "
Start End Value
2018-01-01 2018-01-31 80.6
2018-02-01 2018-02-28 140.5
2018-03-01 2018-03-31 122.1
2018-04-01 2018-04-30 108.5
2018-05-01 2018-05-31 96.0
2018-06-01 2018-06-30 103.3"
ts1 <- transform(read.table(text = Lines1, header = TRUE),
Start = as.Date(Start), End = as.Date(End))
Lines2 <- "
Date Staff
2018-01-15 551
2018-02-14 606
2018-05-18 640
2018-06-02 635
2018-07-15 633"
ts2 <- transform(read.table(text = Lines2, header = TRUE),
Date = as.Date(Date))
CodePudding user response:
One idea is to create a cartesian product and then apply filters to only keep cases that are valid. The benefit of this approach is that you are able to use the usual operations on the data (grouping, mutating, filtering) which brings flexibility to the methodology for resolving ties.
Thank you to G. Grothendieck for the code on creating the datasets from the question.
library(tidyverse)
library(lubridate)
# See G. Grothendieck's answer for ts1 and ts2
value = tibble(ts1)
staff = tibble(ts2)
full_join(value, staff, by = character()) %>%
mutate(valid = Date %within% interval(Start, End)) %>%
group_by(Start, End, Value) %>%
arrange(Date) %>%
summarise(Staff = Staff[valid][1])
Output
Start End Value Staff
<date> <date> <dbl> <int>
1 2018-01-01 2018-01-31 80.6 551
2 2018-02-01 2018-02-28 140. 606
3 2018-03-01 2018-03-31 122. NA
4 2018-04-01 2018-04-30 108. NA
5 2018-05-01 2018-05-31 96 640
6 2018-06-01 2018-06-30 103. 635
Note that in this solution I used the earliest Staff
value in the range. You can easily change this to suit your requirements.
sqldf
solution
library(sqldf)
library(tidyverse)
sqldf(
"SELECT *
FROM value
LEFT JOIN staff
ON staff.Date <= value.End AND staff.Date >= value.Start"
) %>%
group_by(Start, End, Value) %>%
summarise(Staff = Staff[Date == min(Date)], .groups = "drop")