I have two dataframes, x and y, both of which represent items and their respective locations (which are represented as integers). Dataframe x responds to Genes and their locations; dataframe y responds to Enhancers and their locations. For every Gene in dataframe x, I want to find the Enhancer in y whose location is closest. Here are the first five rows of both dataframes:
Gene: | Location:
----------------------------------
CORT | 10450031
LOC107985174 | 110639954
LOC105369199 | 120963648
CD1D | 158178030
EPHA2 | 16124337
Enhancer: | Location:
-----------------------------------------------------
genic|NC_000001.11:180541-181713 | 180541
genic|NC_000001.11:819650-823755 | 819650
genic|NC_000001.11:1290023-1294341 | 1290023
genic|NC_000001.11:2072541-2076498 | 2072541
genic|NC_000001.11:2132164-2134268 | 2132164
I have been using which.min()
like so: Enhancers[which.min(abs(x-Enhancers$location)),]
where x
corresponds to the given gene's location, which does appear to work, but it requires manually inputing the location of each individual gene. I am wondering if there is a way to accomplish this for all the genes at once. Any help would be appreciated, thank you.
CodePudding user response:
You could expand a grid with both data sets, group by Gene and select the row with the minimum absolute difference in location for each Gene.
library(tidyr)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(2105)
x <- data.frame(Gene = letters[1:5], Location = 1:5)
y <- data.frame(Enhancer = letters[6:10], Location_enh = 5*runif(5))
x
#> Gene Location
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
y
#> Enhancer Location_enh
#> 1 f 1.2275958
#> 2 g 2.2874741
#> 3 h 4.2954764
#> 4 i 4.2017862
#> 5 j 0.9555975
expand_grid(x, y) %>%
group_by(Gene) %>%
slice_min(abs(Location - Location_enh)) %>%
ungroup()
#> # A tibble: 5 x 4
#> Gene Location Enhancer Location_enh
#> <chr> <int> <chr> <dbl>
#> 1 a 1 j 0.956
#> 2 b 2 g 2.29
#> 3 c 3 g 2.29
#> 4 d 4 i 4.20
#> 5 e 5 h 4.30
Created on 2022-08-15 by the reprex package (v2.0.1)
CodePudding user response:
Looks like genetic data. If you're already using Bioconductor packages, then Biobase::matchpt()
will do the trick quickly and efficiently (and in multiple dimensions).
CodePudding user response:
This data doesn't show enough similarity to be interesting, but I think this method will work for what you need:
apply(abs(outer(df1$Location, df2$Location, `-`)), 1, which.min)
# [1] 5 5 5 5 5
This is saying that df1$Location
s are all closest to the 5th value of df2$Location
.
Data
df1 <- structure(list(Gene = c("CORT", "LOC107985174", "LOC105369199", "CD1D", "EPHA2"), Location = c(10450031, 110639954, 120963648, 158178030, 16124337)), row.names = c(NA, -5L), class = "data.frame")
df2 <- structure(list(Enhancer = c("NC_000001.11:180541-181713", "NC_000001.11:819650-823755", "NC_000001.11:1290023-1294341", "NC_000001.11:2072541-2076498", "NC_000001.11:2132164-2134268"), Location = c(180541, 819650, 1290023, 2072541, 2132164)), row.names = c(NA, -5L), class = "data.frame")
CodePudding user response:
You can use distance_join
set of functions from fuzzyjoin
R package to join 2 tables by distance. You can provide max_dist
argument to show NA if there is no data within the specified range.
For this demo I've used the test data from @seb09's answer.
library(tidyverse)
library(fuzzyjoin)
set.seed(2105)
x <- data.frame(Gene = letters[1:5], Location = 1:5)
y <- data.frame(Enhancer = letters[6:10], Location_enh = 5*runif(5))
distance_left_join(x, y, by = c("Location" = "Location_enh"), max_dist = 1)
#> Gene Location Enhancer Location_enh
#> 1 a 1 f 1.2275958
#> 2 a 1 j 0.9555975
#> 3 b 2 f 1.2275958
#> 4 b 2 g 2.2874741
#> 5 c 3 g 2.2874741
#> 6 d 4 h 4.2954764
#> 7 d 4 i 4.2017862
#> 8 e 5 h 4.2954764
#> 9 e 5 i 4.2017862
Created on 2022-08-15 by the reprex package (v2.0.0)
If you need to select a single enhancer per gene with closest distance:
distance_left_join(x, y, by = c("Location" = "Location_enh"), max_dist = 1) %>%
group_by(Gene) %>%
slice_min(abs(Location - Location_enh))
CodePudding user response:
This just returns a vector of the matches:
l1=1e3;l2=2e3
x=data.frame(gene=paste0(letters,1:l1),Location=round(runif(l1)*1e8))
y=data.frame(enhancer=paste0(letters,1:l2),Location_enh=round(runif(l2)*1e8))
y[,1][sapply(x[,2],\(v)which.min(abs(y[,2]-v)))]
This adds the matches as a new column:
cbind(x,min=y[,1][sapply(x[,2],\(v)which.min(abs(y[,2]-v)))])
Benchmark:
l1=1e3;l2=2e3
x=data.frame(Gene=paste0(letters,1:l1),Location=round(runif(l1)*1e8))
y=data.frame(Enhancer=paste0(letters,1:l2),Location_enh=round(runif(l2)*1e8))
microbenchmark(times=10,
sapply=sapply(x[,2],\(v)which.min(abs(v-y[,2]))),
outer_with_apply=apply(abs(outer(x[,2],y[,2],`-`)),2,which.min),
outer_with_rowMins=Rfast::rowMins(abs(outer(x[,2],y[,2],`-`))),
tidyverse=expand_grid(x,y)%>%group_by(Gene)%>%slice_min(abs(Location-Location_enh))%>%ungroup(),
distance_left_join=fuzzyjoin::distance_left_join(x,y,by=c("Location"="Location_enh"),max_dist=Inf)
)
Unit: milliseconds
expr min lq mean median uq max neval
sapply 12.11934 12.44806 21.41222 12.76987 14.15599 97.16636 10
outer_with_apply 29.96070 30.93023 77.44443 43.87493 104.09461 263.08380 10
outer_with_rowMins 18.10431 18.45688 63.42812 18.70985 22.11443 355.13233 10
tidyverse 581.14552 646.77041 1029.70376 1082.38570 1245.14755 1667.03231 10
distance_left_join 10993.47979 11528.22607 11759.76920 11637.74354 11977.03131 12753.62809 10