Home > database >  Joining two data frames in R, by nearest date and one other variable
Joining two data frames in R, by nearest date and one other variable

Time:05-04

I have two data frames, a price dataset and a survey dataset, that I need to join on the closest date. The dates don't exactly match, and I need the price dataset to match on the closest date in the survey dataset. A previous question that I asked here on this was very helpful in giving me a an answer as to how to match on the closest date. However, I've since realized that I also need to join the data sets on region, as well as the closest date. I know how to join on closest date, I know how to join on region, but I unfortunately do not know how to join on the closest date and the exact same region.

Here's some example code and what I've been trying

#generating sample dataset
set.seed(1234)
price <- data.frame(region = sample(LETTERS[1:3],15,replace = TRUE), price = rnorm(15),price_date=sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 15)) 
survey <- data.frame(region = sample(LETTERS[1:3],15,replace = TRUE), survey_date=sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 15)) 

#this works to join on the closest date, however, the regions don't match, which is a problem


library(data.table)
setDT(price)
setDT(survey)

join <- price[survey,on=.(price_date=survey_date),roll="nearest"]

Basically what I want to do is match the date/region observation from price to the closest date, same region observation in survey, and I'm not really sure as to how to do this. Would appreciate any help.

CodePudding user response:

You could add region to on condition:

price[survey,.(region,price,x.price_date,survey_date),on=.(region,price_date=survey_date),roll="nearest"][]

    region       price x.price_date survey_date
    <char>       <num>       <Date>      <Date>
 1:      A  0.88010416   1999-10-26  1999-09-05
 2:      B  0.31026217   1999-12-25  1999-12-31
 3:      A -1.68732684   1999-04-27  1999-06-02
 4:      C  0.00500695   1999-08-09  1999-05-11
 5:      C  0.00500695   1999-08-09  1999-03-24
 6:      B -0.03763026   1999-08-12  1999-09-02
 7:      C -0.64701901   1999-12-24  2000-01-01
 8:      B -0.03763026   1999-08-12  1999-08-06
 9:      C  0.00500695   1999-08-09  1999-08-03
10:      A  0.88010416   1999-10-26  1999-11-22
11:      C  1.37001035   1999-09-15  1999-10-03
12:      B  0.01831663   1999-07-01  1999-06-18
13:      A -0.62743621   1999-03-20  1999-03-12
14:      B  0.72397606   1999-02-18  1999-03-02
15:      C -0.64701901   1999-12-24  1999-12-18

Note use of x. to display LHS date.

  • Related