I have 2 tibbles. One is a list of stocks and an associated date. The other is a list of stocks and their sector, where the sector is define between 2 dates.
# stocks IDs and an associated date
stocks<-
tibble(
StockId = c(1, 1, 1, 2, 2),
Date = c(
as.Date("1998-01-05"),
as.Date("2001-07-01"),
as.Date("2015-01-01"),
as.Date("1999-05-01"),
as.Date("2003-02-02")
)
)
# stock IDs and their sector.
# For each stock, the sector has a start and end date as stocks change sector over time
sectors <- tibble(StockId = c(1, 1, 2, 2),
SectorId = c(5050, 2540, 5510, 1010),
StartDate = c(
as.Date("1980-01-05"),
as.Date("2000-07-01"),
as.Date("1999-01-01"),
as.Date("2010-01-02")
),
EndDate = c(
as.Date("2000-06-30"),
as.Date("9999-12-31"),
as.Date("2010-01-01"),
as.Date("9999-01-01")
)
)
I would like to get to a final tibble where I have stock, date and the sector for that particular date. In this case the result should be:
result <-
tibble(
StockId = c(1, 1, 1, 2, 2),
Date = c(
as.Date("1998-01-05"),
as.Date("2001-07-01"),
as.Date("2015-01-01"),
as.Date("1999-05-01"),
as.Date("2003-02-02")
),
SectorId = c(5050, 2540, 2540, 5510, 5510)
)
Note that a for-loop is not efficient here, as my tibble has much more data then this example.
CodePudding user response:
library(fuzzyjoin)
fuzzy_left_join(stocks, sectors,
by = c("StockId" = "StockId",
"Date" = "StartDate",
"Date" = "EndDate"),
match_fun = list(`==`, `>=`, `<=`)) %>%
dplyr::select(StockID = StockId.x, Date, SectorId)
Output:
# A tibble: 5 x 3
StockID Date SectorId
<dbl> <date> <dbl>
1 1 1998-01-05 5050
2 1 2001-07-01 2540
3 1 2015-01-01 2540
4 2 1999-05-01 5510
5 2 2003-02-02 5510
CodePudding user response:
You can also use dplyr
development version (as of 2022-09-05), which includes a join_by
function for overlap joins:
#devtools::install_github("tidyverse/dplyr")
library(dplyr)
inner_join(stocks, sectors, by = join_by(StockId, between(Date, StartDate, EndDate)))
# A tibble: 5 × 5
StockId Date SectorId StartDate EndDate
<dbl> <date> <dbl> <date> <date>
1 1 1998-01-05 5050 1980-01-05 2000-06-30
2 1 2001-07-01 2540 2000-07-01 9999-12-31
3 1 2015-01-01 2540 2000-07-01 9999-12-31
4 2 1999-05-01 5510 1999-01-01 2010-01-01
5 2 2003-02-02 5510 1999-01-01 2010-01-01
And the data.table
option:
library(data.table)
setDT(stocks)[setDT(sectors), .(StockId, Date=x.Date, SectorId),
on=.(StockId, Date>=StartDate, Date<=EndDate), nomatch=0L]
StockId Date SectorId
1: 1 2001-07-01 2540
2: 1 2015-01-01 2540
3: 1 1998-01-05 5050
4: 2 1999-05-01 5510
5: 2 2003-02-02 5510