Home > OS >  merging data frames based on multiple nearest matches in R
merging data frames based on multiple nearest matches in R

Time:03-14

I have a large (reduced and simplified for example) data frame (df1) and a smaller one (df2). My goal is to return the single row of df1 for each variable a:d that most closely matches the value for each a:d in df2, so that I have a value for julian (index) and the nearest match to a-d from df1.

The resulting value for julian for the below example code would be something like 9,20,10,24.

I have tried various approaches with dplyr such as inner_join but have not found a method to accomplish the approximate/nearest matching. I did see an approach to a similar problem in data.table with roll = "nearest", but it seemed to need set a key on a single column and I did not quite see how to adapt to my problem.

df1=structure(list(julian = 1:25, a = c(1.23651000172524, 3.53768533577433, 
4.67886484270661, 5.58620230847292, 6.50334496193416, 8.39024634778858, 
8.07046072833857, 10.3224760103505, 12.0294821656441, 13.0051126737986, 
13.6182967988946, 16.0093802756876, 16.3629833376647, 16.4032251800014, 
20.7793611866438, 22.8896881514255, 24.9047753769525, 27.693270690425, 
28.1354667109412, 28.7461982083997, 29.4226852088927, 30.0674285282579, 
30.5385160713231, 30.2693063298828, 31.2137034020167), b = c(0.0818659903121299, 
3.16872244667597, 5.18716715995415, 8.01357800633546, 9.03929283020061, 
9.75879226945875, 10.6840909493883, 11.993517678357, 13.5471390532808, 
14.5735221374674, 14.5629644871486, 14.3256165384972, 16.1055271965235, 
16.0846958235405, 16.2822070785324, 17.5707074083182, 18.0211458028877, 
18.2487504080849, 18.2748191242643, 20.2487076102867, 22.4499741093445, 
22.7261732975307, 23.3057791855779, 24.205539951076, 26.8265110564906
), c = c(1.53338066501425, 1.79175662364937, 3.87114113188336, 
7.25916181355392, 7.27167564643793, 9.45150932098826, 9.79130847937593, 
8.74357825723914, 7.65948191832803, 8.29615547570692, 10.0896110687571, 
10.622117084693, 12.8381085732359, 12.2479308117301, 14.1459672328209, 
15.1971928224078, 16.2405218836198, 17.6742310372874, 18.2592891390671, 
18.5239348003001, 17.8195001436016, 18.2938114977865, 19.3566862831869, 
20.2531494246964, 21.2021100407786), d = c(1.06048990933208, 
1.19478517029592, 2.01700007570706, 2.89122927429884, 2.84927053502588, 
1.99042435941095, 2.84730950732035, 5.24044843997076, 6.78411332171822, 
6.94262185334297, 8.46835997983591, 9.00847980399115, 10.0903648270813, 
12.7882885080826, 13.5944586952418, 13.8267656264867, 14.7989016465317, 
15.7181393355348, 17.3822790373397, 17.8017108178726, 19.8442415003192, 
21.1743994279903, 20.9854326674201, 22.0177140097714, 24.0917743668801
)), class = "data.frame", row.names = c(NA, -25L))

 df2=structure(list(a = 12.35, b = 19.63, c = 8.3, d = 22.5), class = "data.frame", row.names = c(NA, 
-1L))

CodePudding user response:

The following uses Map to compare each column a:d in df1 to the corresponding reference value in df2, finds the index of the minimum, and finally, uses that as an index into the "julian" column of df1

library(tidyverse)

indices <- Map(`-`, df1[-1], df2) %>% 
  lapply(function(x) which(abs(x) == min(abs(x)))) %>% 
  sapply(function(x) df1[['julian']][x])

 a  b  c  d 
 9 20 10 24 

And if you then wanted to filter df1:

df1 %>% 
  filter(julian %in% indices)

  julian        a        b         c         d
1      9 12.02948 13.54714  7.659482  6.784113
2     10 13.00511 14.57352  8.296155  6.942622
3     20 28.74620 20.24871 18.523935 17.801711
4     24 30.26931 24.20554 20.253149 22.017714

CodePudding user response:

Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest" that you mentioned.

In this case I've melted both sets of data to long datasets so that the matching can be done in a single join.

library(data.table)
setDT(df1)
setDT(df2)

df1[
  julian %in% 
    melt(df1, id.vars="julian")[
      melt(df2, measure.vars=names(df2)), on=c("variable","value"), roll="nearest"
    ]$julian,
]
#   julian        a        b         c         d
#1:      9 12.02948 13.54714  7.659482  6.784113
#2:     10 13.00511 14.57352  8.296155  6.942622
#3:     20 28.74620 20.24871 18.523935 17.801711
#4:     24 30.26931 24.20554 20.253149 22.017714

If you want separate tables for each join instead you could do something like:

lapply(names(df2), \(var)  df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )
  • Related