How to find the closest value to x in y dataframe in R?-CodePudding

I have two dataframes, x and y, both of which represent items and their respective locations (which are represented as integers). Dataframe x responds to Genes and their locations; dataframe y responds to Enhancers and their locations. For every Gene in dataframe x, I want to find the Enhancer in y whose location is closest. Here are the first five rows of both dataframes:

Gene:           | Location:                            
----------------------------------                    
CORT            | 10450031             
LOC107985174    | 110639954
LOC105369199    | 120963648
CD1D            | 158178030
EPHA2           | 16124337


         Enhancer:                  |   Location:
-----------------------------------------------------
genic|NC_000001.11:180541-181713    |   180541
genic|NC_000001.11:819650-823755    |   819650   
genic|NC_000001.11:1290023-1294341  |   1290023
genic|NC_000001.11:2072541-2076498  |   2072541
genic|NC_000001.11:2132164-2134268  |   2132164

I have been using which.min() like so: Enhancers[which.min(abs(x-Enhancers$location)),] where x corresponds to the given gene's location, which does appear to work, but it requires manually inputing the location of each individual gene. I am wondering if there is a way to accomplish this for all the genes at once. Any help would be appreciated, thank you.

CodePudding user response：

You could expand a grid with both data sets, group by Gene and select the row with the minimum absolute difference in location for each Gene.

library(tidyr)
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(2105)

x <- data.frame(Gene = letters[1:5], Location = 1:5)
y <- data.frame(Enhancer = letters[6:10], Location_enh = 5*runif(5))

x
#>   Gene Location
#> 1    a        1
#> 2    b        2
#> 3    c        3
#> 4    d        4
#> 5    e        5
y
#>   Enhancer Location_enh
#> 1        f    1.2275958
#> 2        g    2.2874741
#> 3        h    4.2954764
#> 4        i    4.2017862
#> 5        j    0.9555975

expand_grid(x, y) %>% 
  group_by(Gene) %>% 
  slice_min(abs(Location - Location_enh)) %>% 
  ungroup()
#> # A tibble: 5 x 4
#>   Gene  Location Enhancer Location_enh
#>   <chr>    <int> <chr>           <dbl>
#> 1 a            1 j               0.956
#> 2 b            2 g               2.29 
#> 3 c            3 g               2.29 
#> 4 d            4 i               4.20 
#> 5 e            5 h               4.30

^{Created on 2022-08-15 by the reprex package (v2.0.1)}

CodePudding user response：

Looks like genetic data. If you're already using Bioconductor packages, then Biobase::matchpt() will do the trick quickly and efficiently (and in multiple dimensions).

CodePudding user response：

This data doesn't show enough similarity to be interesting, but I think this method will work for what you need:

apply(abs(outer(df1$Location, df2$Location, `-`)), 1, which.min)
# [1] 5 5 5 5 5

This is saying that df1$Locations are all closest to the 5th value of df2$Location.

Data

df1 <- structure(list(Gene = c("CORT", "LOC107985174", "LOC105369199", "CD1D", "EPHA2"), Location = c(10450031, 110639954, 120963648, 158178030, 16124337)), row.names = c(NA, -5L), class = "data.frame")
df2 <- structure(list(Enhancer = c("NC_000001.11:180541-181713", "NC_000001.11:819650-823755", "NC_000001.11:1290023-1294341", "NC_000001.11:2072541-2076498", "NC_000001.11:2132164-2134268"), Location = c(180541, 819650, 1290023, 2072541, 2132164)), row.names = c(NA, -5L), class = "data.frame")

CodePudding user response：

You can use distance_join set of functions from fuzzyjoin R package to join 2 tables by distance. You can provide max_dist argument to show NA if there is no data within the specified range.

For this demo I've used the test data from @seb09's answer.

library(tidyverse)
library(fuzzyjoin)

set.seed(2105)

x <- data.frame(Gene = letters[1:5], Location = 1:5)
y <- data.frame(Enhancer = letters[6:10], Location_enh = 5*runif(5))

distance_left_join(x, y, by = c("Location" = "Location_enh"), max_dist = 1)
#>   Gene Location Enhancer Location_enh
#> 1    a        1        f    1.2275958
#> 2    a        1        j    0.9555975
#> 3    b        2        f    1.2275958
#> 4    b        2        g    2.2874741
#> 5    c        3        g    2.2874741
#> 6    d        4        h    4.2954764
#> 7    d        4        i    4.2017862
#> 8    e        5        h    4.2954764
#> 9    e        5        i    4.2017862

^{Created on 2022-08-15 by the reprex package (v2.0.0)}

If you need to select a single enhancer per gene with closest distance:

distance_left_join(x, y, by = c("Location" = "Location_enh"), max_dist = 1) %>% 
  group_by(Gene) %>% 
  slice_min(abs(Location - Location_enh))

CodePudding user response：

This just returns a vector of the matches:

l1=1e3;l2=2e3
x=data.frame(gene=paste0(letters,1:l1),Location=round(runif(l1)*1e8))
y=data.frame(enhancer=paste0(letters,1:l2),Location_enh=round(runif(l2)*1e8))

y[,1][sapply(x[,2],\(v)which.min(abs(y[,2]-v)))]

This adds the matches as a new column:

cbind(x,min=y[,1][sapply(x[,2],\(v)which.min(abs(y[,2]-v)))])

Benchmark:

l1=1e3;l2=2e3
x=data.frame(Gene=paste0(letters,1:l1),Location=round(runif(l1)*1e8))
y=data.frame(Enhancer=paste0(letters,1:l2),Location_enh=round(runif(l2)*1e8))

microbenchmark(times=10,
  sapply=sapply(x[,2],\(v)which.min(abs(v-y[,2]))),
  outer_with_apply=apply(abs(outer(x[,2],y[,2],`-`)),2,which.min),
  outer_with_rowMins=Rfast::rowMins(abs(outer(x[,2],y[,2],`-`))),
  tidyverse=expand_grid(x,y)%>%group_by(Gene)%>%slice_min(abs(Location-Location_enh))%>%ungroup(),
  distance_left_join=fuzzyjoin::distance_left_join(x,y,by=c("Location"="Location_enh"),max_dist=Inf)
)

Unit: milliseconds
               expr         min          lq        mean      median          uq         max neval
             sapply    12.11934    12.44806    21.41222    12.76987    14.15599    97.16636    10
   outer_with_apply    29.96070    30.93023    77.44443    43.87493   104.09461   263.08380    10
 outer_with_rowMins    18.10431    18.45688    63.42812    18.70985    22.11443   355.13233    10
          tidyverse   581.14552   646.77041  1029.70376  1082.38570  1245.14755  1667.03231    10
 distance_left_join 10993.47979 11528.22607 11759.76920 11637.74354 11977.03131 12753.62809    10