Match one dataframe based on a range in another dataframe in R tidyverse-CodePudding

I have two large datasets that want to match with each other

library(tidyverse)

df1 <- tibble(position=c(10,11,200,250,300))
df1
#> # A tibble: 5 × 1
#>   position
#>      <dbl>
#> 1       10
#> 2       11
#> 3      200
#> 4      250
#> 5      300

df2 <- tibble(start=c(1,10,200,251),
              end=c(20,100,250,350),
              name=c("geneA","geneB","geneC","geneD"))
df2
#> # A tibble: 4 × 3
#>   start   end name 
#>   <dbl> <dbl> <chr>
#> 1     1    20 geneA
#> 2    10   100 geneB
#> 3   200   250 geneC
#> 4   251   350 geneD

^{Created on 2022-03-03 by the reprex package (v2.0.1)}

I have the position of the genes in the df1 and I want to find based on the range (start-end) from the df2 how many genes can be found in this position.

I want my data to look like this

  position start   end name 
     <dbl> <dbl> <dbl> <chr>
1       10     1    20 geneA
2       10    10   100 geneB
3       11     1    20 geneA
4       11    10   100 geneB
5      200   200   250 geneC
6      250   200   250 geneC
7      300   251   350 geneD

One way to solve this could be through crossing and filtering

df1 %>% 
  crossing(df2) %>% 
  filter(position >= start & position <= end)

However my dataset is way too large and can not afford crossing or expanding. Any other idea?

CodePudding user response：

SQL engines can perform such operations without crossing. (It may be possible to speed it up even more if you add indexes.)

library(sqldf)

sqldf("select *
  from df1 a
  join df2 b on a.position between b.start and b.end")

CodePudding user response：

crossing is a wrapper around expand_grid and does additional stuff e.g. filtering. You can use it directly:

library(tidyverse)

df1 <- tibble(position = c(10, 11, 200, 250, 300))
df1
#> # A tibble: 5 × 1
#>   position
#>      <dbl>
#> 1       10
#> 2       11
#> 3      200
#> 4      250
#> 5      300


df2 <- tibble(
  start = c(1, 10, 200, 251),
  end = c(20, 100, 250, 350),
  name = c("geneA", "geneB", "geneC", "geneD")
)

expand_grid(df1, df2) %>%
  filter(position >= start & position <= end)
#> # A tibble: 7 × 4
#>   position start   end name 
#>      <dbl> <dbl> <dbl> <chr>
#> 1       10     1    20 geneA
#> 2       10    10   100 geneB
#> 3       11     1    20 geneA
#> 4       11    10   100 geneB
#> 5      200   200   250 geneC
#> 6      250   200   250 geneC
#> 7      300   251   350 geneD

^{Created on 2022-03-03 by the reprex package (v2.0.0)}

CodePudding user response：

Here is a dplyr way (sort of).

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df1 <- tibble(position = c(10, 11, 200, 250, 300))

df2 <- tibble(
  start = c(1, 10, 200, 251),
  end = c(20, 100, 250, 350),
  name = c("geneA", "geneB", "geneC", "geneD")
)

vbetween <- function(data, col, data2, start, end){
  f <- function(x, l, r) l <= x & x <= r
  col <- enquo(col)
  start <- enquo(start)
  end <- enquo(end)
  x <- data %>% pull(!!col)
  l <- data2 %>% pull(!!start)
  r <- data2 %>% pull(!!end)
  yes <- lapply(x, f, l = l, r = r)
  lapply(yes, \(i) data2[i, ])
}

df1 %>% vbetween(position, df2, start, end) %>% bind_rows()
#> # A tibble: 7 x 3
#>   start   end name 
#>   <dbl> <dbl> <chr>
#> 1     1    20 geneA
#> 2    10   100 geneB
#> 3     1    20 geneA
#> 4    10   100 geneB
#> 5   200   250 geneC
#> 6   200   250 geneC
#> 7   251   350 geneD

^{Created on 2022-03-03 by the reprex package (v2.0.1)}