Say that I have these dataframes:
SNPs <- c('SNP1', 'SNP2', 'SNP3', 'SNP4', 'SNP5', 'SNP6')
Basepair <- c('75', '251', '643', '820', '952', '1455')
df <- data.frame(SNPs, Basepair)
BP_start <- c('0', '100', '200', '300', '400', '500', '600', '700', '800', '900', '1000', '1100', '1200', '1300', '1400', '1500', '1600')
BP_end <- c('100', '200', '300', '400', '500', '600', '700', '800', '900', '1000', '1100', '1200', '1300', '1400', '1500', '1600', '1700')
sweep_nr <- c(1:17)
df_sweep <- data.frame(BP_start, BP_end, sweep_nr)
And I want to merge the df_sweep dataframe with the df dataframe, so that there is a new column showing what sweep_nr the SNP is included in.
Then, they'd have to merge based on ranges, since for example SNP1 is between BP_start == 0 and BP_end == 100, but I haven't found any way to do this kind of merge.
CodePudding user response:
With data.table
it is quite is easy to do non_equi joins, but you just have to make sure that your Basepair
, BP_start
and BP_end
-columns are numeric first. Then transform your data.frames to data.table and merge them:
SNPs <- c('SNP1', 'SNP2', 'SNP3', 'SNP4', 'SNP5', 'SNP6')
Basepair <- c(75, 251, 643, 820, 952, 1455)
df <- data.frame(SNPs, Basepair)
BP_start <- seq(0, 1600, by = 100)
BP_end <- seq(100, 1700, by = 100)
sweep_nr <- c(1:17)
df_sweep <- data.frame(BP_start, BP_end, sweep_nr)
library(data.table)
setDT(df) # transform to data.table
setDT(df_sweep) # transform to data.table
df_merge <- as.data.frame(df_sweep[df,
.(SNPs, Basepair, sweep_nr),
on = .(BP_start <= Basepair, BP_end >= Basepair)
])
df_merge
#> SNPs Basepair sweep_nr
#> 1 SNP1 75 1
#> 2 SNP2 251 3
#> 3 SNP3 643 7
#> 4 SNP4 820 9
#> 5 SNP5 952 10
#> 6 SNP6 1455 15
Created on 2022-10-19 by the reprex package (v2.0.1)
In the end you can transform the result back to a data.frame if you are more comfortable with using them.
CodePudding user response:
Does it need to be merged?
df_sweep <- data.frame(BP_start, BP_end, sweep_nr) %>% map_df(as.numeric)
filter_range <- function(bp, df_sweep){
df_sweep %>% filter(bp >= BP_start & bp < BP_end) %>% pull(sweep_nr)
}
df <- df %>% rowwise() %>% mutate(new_col = filter_range(as.numeric(Basepair), df_sweep))
CodePudding user response:
Assuming that BP_start
actually matches BP_end
(so the BP_end
of the 1st row is exactly the BP_start
of the next), you don't actually need to look at both: you just use BP_start
and the lovely one-liner findInterval
.
df$sweep <- df_sweep$sweep_nr[findInterval(as.numeric(df$Basepair), as.numeric(df_sweep$BP_start))]
df
SNPs Basepair sweep
1 SNP1 75 1
2 SNP2 251 3
3 SNP3 643 7
4 SNP4 820 9
5 SNP5 952 10
6 SNP6 1455 15
CodePudding user response:
This is certainly not the best answer, but you already have a tidyverse
and a data.table
answer, and I spent more time on this than I am willing to admit, so here it goes. I'm not super satisfied with this, so feel free to critique. Note that this is not actually merging the dataframes. Instead, we are just figuring out the sweep_nr
that corresponds to the range and value in df
.
sweeplist <- list(start = df_sweep$BP_start, end = df_sweep$BP_end)
`%within%` <- function(lhs, rhs){
lhs <- as.numeric(lhs)
Map(\(x,y){
x <- as.numeric(x)
y <- as.numeric(y)
lhs %in% {x:y}
}, rhs$start, rhs$end) |>
unlist()|>
unname()|>
which()
}
df$sweep_nr <- df_sweep$sweep_nr[sapply(df$Basepair, \(x) x %within% sweeplist)]
df
#> SNPs Basepair sweep_nr
#> 1 SNP1 75 1
#> 2 SNP2 251 3
#> 3 SNP3 643 7
#> 4 SNP4 820 9
#> 5 SNP5 952 10
#> 6 SNP6 1455 15