Home > database >  How to find the best matching dates on two dataframes in r?
How to find the best matching dates on two dataframes in r?

Time:05-19

I have two dataframes insitu and model:

dput(head(insitu,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR", 
"LAM", "LAM", "LAM", "LAM", "LAM", "LAM"), D_SOS = structure(c(16929, 
17149, 17422, 17850, 18389, 18202, 17044, 16744, 17300, 17522, 
18027, 18198), class = "Date"), D_EOS = structure(c(17067, 17353, 
17712, 18082, 18516, 18360, 17123, 17002, 17414, 17722, 18148, 
18446), class = "Date")), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

dput(head(model,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR", 
"AUR", "AUR", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM", "LAM"
), EVI_SOS = structure(c(16934, 17137, 17378, 17605, 17862, 18003, 
18192, 18395, 16744, 17134, 17278, 17518, 17725, 18004, 18200
), class = "Date"), EVI_EOS = structure(c(17074, 17361, 17591, 
17798, 17994, 18096, 18376, 18594, 17106, 17252, 17431, 17705, 
17862, 18173, 18549), class = "Date")), row.names = c(NA, -15L
), class = c("tbl_df", "tbl", "data.frame"))

What I want to do is the following: I want to find the best matching dates between two dataframes on the selected columns and corresponding rows. In other words, in the dataframe insitu in the column D_SOS which intercept with the rows of AUR of the column ID which dates best match with the column EVI_SOS which intercept the rows AUR of the column ID from the dataframe model. The same will have to be made on the LAM rows.

The desired output example would be:

dput(head(output,20))
structure(list(ID = c("AUR", "AUR", "AUR", "AUR", "AUR", "AUR", 
"LAM", "LAM", "LAM", "LAM", "LAM", "LAM"), D_SOS = structure(c(16929, 
17149, 17422, 17850, 18389, 18202, 17044, 16744, 17300, 17522, 
18027, 18198), class = "Date"), EVI_SOS = structure(c(16934, 
17137, 17378, 17862, 18395, 18192, 17134, 16744, 17278, 17518, 
18004, 18200), class = "Date"), D_EOS = structure(c(17067, 17353, 
17712, 18082, 18516, 18360, 17123, 17002, 17414, 17722, 18148, 
18446), class = "Date"), EVI_EOS = structure(c(17074, 17361, 
17798, 18096, 18594, 18376, 17252, 17106, 17431, 17705, 18173, 
18549), class = "Date")), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

It will look like this:

   ID    D_SOS        EVI_SOS     D_EOS        EVI_EOS  
1  AUR   2016-05-08   2016-05-13  2016-09-23   2016-09-30 
2  AUR   2016-12-14   2016-12-02  2017-07-06   2017-07-14       
3  AUR   2017-09-13   2017-07-31  2018-06-30   2018-09-24
4  AUR   2018-11-15   2018-11-27  2019-07-05   2019-07-19
5  AUR   2020-05-17   2020-05-13  2020-09-11   2020-11-28
6  AUR   2019-11-02   2019-10-23  2020-04-08   2020-04-24
7  LAM   2016-08-31   2016-11-29  2016-11-18   2017-03-27
8  LAM   2015-11-05   2015-11-05  2016-07-20   2016-11-01
9  LAM   2017-05-14   2017-04-22  2017-09-05   2017-09-22
10 LAM   2017-12-22   2017-12-18  2018-07-10   2018-06-23
11 LAM   2019-05-11   2019-04-18  2019-09-09   2019-10-04
12 LAM   2019-10-29   2019-10-31  2020-07-03   2020-10-14

Basically, over the 8 AUR dates of model dataframe 6 will match since insitu only has 6 dates in AUR. For LAM model dataframe has 7 dates but insitu has 6 dates in LAM which will be the number to match. The output would have first the column of insitu, for example D_SOS and then the one of model, for example EVI_SOS which match with the correspondent dates.

Any help will be much appreciatted.

CodePudding user response:

library(data.table)

setDT(insitu)
setDT(model)

insitu[, key := D_SOS]
model[, key := EVI_SOS]

setkey(insitu, ID, key)
setkey(model, ID, key)

model[insitu, roll = "nearest"][, .(ID, D_SOS, EVI_SOS, D_EOS, EVI_EOS)]

#      ID      D_SOS    EVI_SOS      D_EOS    EVI_EOS
#  1: AUR 2016-05-08 2016-05-13 2016-09-23 2016-09-30
#  2: AUR 2016-12-14 2016-12-02 2017-07-06 2017-07-14
#  3: AUR 2017-09-13 2017-07-31 2018-06-30 2018-03-01
#  4: AUR 2018-11-15 2018-11-27 2019-07-05 2019-04-08
#  5: AUR 2019-11-02 2019-10-23 2020-04-08 2020-04-24
#  6: AUR 2020-05-07 2020-05-13 2020-09-11 2020-11-28
#  7: LAM 2015-11-05 2015-11-05 2016-07-20 2016-11-01
#  8: LAM 2016-08-31 2016-11-29 2016-11-18 2017-03-27
#  9: LAM 2017-05-14 2017-04-22 2017-09-05 2017-09-22
# 10: LAM 2017-12-22 2017-12-18 2018-07-10 2018-06-23
# 11: LAM 2019-05-11 2019-04-18 2019-09-09 2019-10-04
# 12: LAM 2019-10-29 2019-10-31 2020-07-03 2020-10-14
  • Related