I have a large (reduced and simplified for example) data frame (df1) and a smaller one (df2). My goal is to return the single row of df1 for each variable a:d that most closely matches the value for each a:d in df2, so that I have a value for julian (index) and the nearest match to a-d from df1.
The resulting value for julian for the below example code would be something like 9,20,10,24.
I have tried various approaches with dplyr such as inner_join but have not found a method to accomplish the approximate/nearest matching. I did see an approach to a similar problem in data.table with roll = "nearest", but it seemed to need set a key on a single column and I did not quite see how to adapt to my problem.
df1=structure(list(julian = 1:25, a = c(1.23651000172524, 3.53768533577433,
4.67886484270661, 5.58620230847292, 6.50334496193416, 8.39024634778858,
8.07046072833857, 10.3224760103505, 12.0294821656441, 13.0051126737986,
13.6182967988946, 16.0093802756876, 16.3629833376647, 16.4032251800014,
20.7793611866438, 22.8896881514255, 24.9047753769525, 27.693270690425,
28.1354667109412, 28.7461982083997, 29.4226852088927, 30.0674285282579,
30.5385160713231, 30.2693063298828, 31.2137034020167), b = c(0.0818659903121299,
3.16872244667597, 5.18716715995415, 8.01357800633546, 9.03929283020061,
9.75879226945875, 10.6840909493883, 11.993517678357, 13.5471390532808,
14.5735221374674, 14.5629644871486, 14.3256165384972, 16.1055271965235,
16.0846958235405, 16.2822070785324, 17.5707074083182, 18.0211458028877,
18.2487504080849, 18.2748191242643, 20.2487076102867, 22.4499741093445,
22.7261732975307, 23.3057791855779, 24.205539951076, 26.8265110564906
), c = c(1.53338066501425, 1.79175662364937, 3.87114113188336,
7.25916181355392, 7.27167564643793, 9.45150932098826, 9.79130847937593,
8.74357825723914, 7.65948191832803, 8.29615547570692, 10.0896110687571,
10.622117084693, 12.8381085732359, 12.2479308117301, 14.1459672328209,
15.1971928224078, 16.2405218836198, 17.6742310372874, 18.2592891390671,
18.5239348003001, 17.8195001436016, 18.2938114977865, 19.3566862831869,
20.2531494246964, 21.2021100407786), d = c(1.06048990933208,
1.19478517029592, 2.01700007570706, 2.89122927429884, 2.84927053502588,
1.99042435941095, 2.84730950732035, 5.24044843997076, 6.78411332171822,
6.94262185334297, 8.46835997983591, 9.00847980399115, 10.0903648270813,
12.7882885080826, 13.5944586952418, 13.8267656264867, 14.7989016465317,
15.7181393355348, 17.3822790373397, 17.8017108178726, 19.8442415003192,
21.1743994279903, 20.9854326674201, 22.0177140097714, 24.0917743668801
)), class = "data.frame", row.names = c(NA, -25L))
df2=structure(list(a = 12.35, b = 19.63, c = 8.3, d = 22.5), class = "data.frame", row.names = c(NA,
-1L))
CodePudding user response:
The following uses Map
to compare each column a:d in df1
to the corresponding reference value in df2
, finds the index of the minimum, and finally, uses that as an index into the "julian" column of df1
library(tidyverse)
indices <- Map(`-`, df1[-1], df2) %>%
lapply(function(x) which(abs(x) == min(abs(x)))) %>%
sapply(function(x) df1[['julian']][x])
a b c d
9 20 10 24
And if you then wanted to filter df1
:
df1 %>%
filter(julian %in% indices)
julian a b c d
1 9 12.02948 13.54714 7.659482 6.784113
2 10 13.00511 14.57352 8.296155 6.942622
3 20 28.74620 20.24871 18.523935 17.801711
4 24 30.26931 24.20554 20.253149 22.017714
CodePudding user response:
Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest"
that you mentioned.
In this case I've melt
ed both sets of data to long datasets so that the matching can be done in a single join.
library(data.table)
setDT(df1)
setDT(df2)
df1[
julian %in%
melt(df1, id.vars="julian")[
melt(df2, measure.vars=names(df2)), on=c("variable","value"), roll="nearest"
]$julian,
]
# julian a b c d
#1: 9 12.02948 13.54714 7.659482 6.784113
#2: 10 13.00511 14.57352 8.296155 6.942622
#3: 20 28.74620 20.24871 18.523935 17.801711
#4: 24 30.26931 24.20554 20.253149 22.017714
If you want separate tables for each join instead you could do something like:
lapply(names(df2), \(var) df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )