Home > Software engineering >  Joining two data frames on the closest date in R
Joining two data frames on the closest date in R

Time:05-03

I have two datasets that I would like to join based on date. One is a survey dataset, and the other is a list of prices at various dates. The dates don't match exactly, so I would like to join on the nearest date in the survey dataset (the price data is weekly).

Here's a brief snippet of what the survey dataset looks like (there are many other variables, but here's the two most relevant):

ID actual.date
20120377 2012-09-26
2020455822 2020-11-23
20126758 2012-10-26
20124241 2012-10-25
2020426572 2020-11-28

And here's the price dataset (also much larger, but you get the idea):

date price.var1 price.var2
2017-10-30 2.74733926399869 2.73994826674735
2015-03-16 2.77028200438506 2.74079930272231
2010-10-18 3.4265947805337 3.41591263539176
2012-10-29 4.10095806545397 4.14717556976502
2012-01-09 3.87888859352037 3.93074237884497

What I would like to do is join the price dataset to the survey dataset, joining on the nearest date.

I've tried a number of different things, none of which have worked to my satisfaction.

#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1:   20120377  2012-09-26
2: 2020455822  2020-11-23
3:   20126758  2012-10-26
4:   20124241  2012-10-25
5: 2020426572  2020-11-28
> ") %>% select(-V1)

price <- fread("date price.var1 price.var2
1: 2017-10-30   2.747339   2.739948
2: 2015-03-16   2.770282   2.740799
3: 2010-10-18   3.426595   3.415913
4: 2012-10-29   4.100958   4.147176
5: 2012-01-09   3.878889   3.930742") %>% select(-V1)


#using data.table

setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]

survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work

#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs

Would really appreciate any help anyone could give me here.

CodePudding user response:

You were almost there.
In the DT[i,on] syntax, i should be survey to join on all its rows

setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price

         date price.var1 price.var2         ID
       <IDat>      <num>      <num>      <int>
1: 2012-09-26   4.100958   4.147176   20120377
2: 2020-11-23   2.747339   2.739948 2020455822
3: 2012-10-26   4.100958   4.147176   20126758
4: 2012-10-25   4.100958   4.147176   20124241
5: 2020-11-28   2.747339   2.739948 2020426572

CodePudding user response:

Convert the dates to numeric and find the closest date from the survey for price with Closest() from DescTools, and take that value.

Example datasets

survey <- tibble(
  ID = sample(20000:40000, 9, replace = TRUE),
  actual.date = seq(today() %m % days(5), today() %m % days(5) %m % months(2),
                    "week")
)

price <- tibble(
  date = seq(today(), today() %m % months(2), by = "week"),
  price_1 = sample(2:6, 9, replace = TRUE),
  price_2 = sample(2:6, 9, replace = TRUE)
)

survey

# A tibble: 9 x 2
     ID actual.date
  <int> <date>     
1 34592 2022-05-07 
2 37846 2022-05-14 
3 22715 2022-05-21 
4 22510 2022-05-28 
5 30143 2022-06-04 
6 34348 2022-06-11 
7 21538 2022-06-18 
8 39802 2022-06-25 
9 36493 2022-07-02 

price

# A tibble: 9 x 3
  date       price_1 price_2
  <date>       <int>   <int>
1 2022-05-02       6       6
2 2022-05-09       3       2
3 2022-05-16       6       4
4 2022-05-23       6       2
5 2022-05-30       2       6
6 2022-06-06       2       4
7 2022-06-13       2       2
8 2022-06-20       3       5
9 2022-06-27       5       6
library(tidyverse)
library(lubridate) 
library(DescTools) 

price <- price %>% 
  mutate(date = Closest(survey$actual.date %>%
                          as.numeric, date %>%
                          as.numeric) %>%
           as_date())

# A tibble: 9 x 3
  date       price_1 price_2
  <date>       <int>   <int>
1 2022-05-07       6       6
2 2022-05-14       3       2
3 2022-05-21       6       4
4 2022-05-28       6       2
5 2022-06-04       2       6
6 2022-06-11       2       4
7 2022-06-18       2       2
8 2022-06-25       3       5
9 2022-07-02       5       6

merge(survey, price, by.x = "actual.date", by.y = "date")

 actual.date    ID price_1 price_2
1  2022-05-07 34592       6       6
2  2022-05-14 37846       3       2
3  2022-05-21 22715       6       4
4  2022-05-28 22510       6       2
5  2022-06-04 30143       2       6
6  2022-06-11 34348       2       4
7  2022-06-18 21538       2       2
8  2022-06-25 39802       3       5
9  2022-07-02 36493       5       6
  • Related