How do you merge two dataframes in R where you only merge the rows that are within a certain range?-CodePudding

Say that I have these dataframes:

SNPs <- c('SNP1', 'SNP2', 'SNP3', 'SNP4', 'SNP5', 'SNP6')

Basepair <- c('75', '251', '643', '820', '952', '1455')

df <- data.frame(SNPs, Basepair)


BP_start <- c('0', '100', '200', '300', '400', '500', '600', '700', '800', '900', '1000', '1100', '1200', '1300', '1400', '1500', '1600')

BP_end <- c('100', '200', '300', '400', '500', '600', '700', '800', '900', '1000', '1100', '1200', '1300', '1400', '1500', '1600', '1700')

sweep_nr <- c(1:17)
df_sweep <- data.frame(BP_start, BP_end, sweep_nr)

And I want to merge the df_sweep dataframe with the df dataframe, so that there is a new column showing what sweep_nr the SNP is included in.

Then, they'd have to merge based on ranges, since for example SNP1 is between BP_start == 0 and BP_end == 100, but I haven't found any way to do this kind of merge.

CodePudding user response：

With data.table it is quite is easy to do non_equi joins, but you just have to make sure that your Basepair, BP_start and BP_end-columns are numeric first. Then transform your data.frames to data.table and merge them:

SNPs <- c('SNP1', 'SNP2', 'SNP3', 'SNP4', 'SNP5', 'SNP6')
Basepair <- c(75, 251, 643, 820, 952, 1455)
df <- data.frame(SNPs, Basepair)

BP_start <- seq(0, 1600, by = 100)
BP_end <- seq(100, 1700, by = 100)
sweep_nr <- c(1:17)
df_sweep <- data.frame(BP_start, BP_end, sweep_nr)

library(data.table)
setDT(df) # transform to data.table
setDT(df_sweep) # transform to data.table

df_merge <- as.data.frame(df_sweep[df,
                                   .(SNPs, Basepair, sweep_nr),
                                   on = .(BP_start <= Basepair, BP_end >= Basepair)
                                   ])
df_merge
#>   SNPs Basepair sweep_nr
#> 1 SNP1       75        1
#> 2 SNP2      251        3
#> 3 SNP3      643        7
#> 4 SNP4      820        9
#> 5 SNP5      952       10
#> 6 SNP6     1455       15

^{Created on 2022-10-19 by the reprex package (v2.0.1)}

In the end you can transform the result back to a data.frame if you are more comfortable with using them.

CodePudding user response：

Does it need to be merged?

df_sweep <- data.frame(BP_start, BP_end, sweep_nr) %>% map_df(as.numeric)

filter_range <- function(bp, df_sweep){
  df_sweep %>% filter(bp >= BP_start & bp < BP_end) %>% pull(sweep_nr)
}

df <- df %>% rowwise() %>% mutate(new_col = filter_range(as.numeric(Basepair), df_sweep))

CodePudding user response：

Assuming that BP_start actually matches BP_end (so the BP_end of the 1st row is exactly the BP_start of the next), you don't actually need to look at both: you just use BP_start and the lovely one-liner findInterval.

df$sweep <- df_sweep$sweep_nr[findInterval(as.numeric(df$Basepair), as.numeric(df_sweep$BP_start))]

df
  SNPs Basepair sweep
1 SNP1       75     1
2 SNP2      251     3
3 SNP3      643     7
4 SNP4      820     9
5 SNP5      952    10
6 SNP6     1455    15

CodePudding user response：

This is certainly not the best answer, but you already have a tidyverse and a data.table answer, and I spent more time on this than I am willing to admit, so here it goes. I'm not super satisfied with this, so feel free to critique. Note that this is not actually merging the dataframes. Instead, we are just figuring out the sweep_nr that corresponds to the range and value in df.

sweeplist <- list(start = df_sweep$BP_start, end = df_sweep$BP_end)


`%within%` <- function(lhs, rhs){
  lhs <- as.numeric(lhs)

  Map(\(x,y){
    x <- as.numeric(x)
    y <- as.numeric(y)
    lhs %in% {x:y} 
  }, rhs$start, rhs$end) |>
    unlist()|>
    unname()|>
    which()
}


df$sweep_nr <- df_sweep$sweep_nr[sapply(df$Basepair, \(x) x %within% sweeplist)]

df
#>   SNPs Basepair sweep_nr
#> 1 SNP1       75        1
#> 2 SNP2      251        3
#> 3 SNP3      643        7
#> 4 SNP4      820        9
#> 5 SNP5      952       10
#> 6 SNP6     1455       15