I have two data frames as below:
df1 <- data.frame(chrom = c(1,1,3,6,6),
chromStart = c(15433, 1959,34205,35043, 77456),
chromEnd = c(15700, 2001,36245,36245,78469),
id = c('aaad', 'dfk', 'bb', 'llk', 'ie9o'))
df2 <- data.frame(chrom = c(1,1,5,1,6),
chromStart2 = c(15433, 1961,34205,1962, 77456),
chromEnd2 = c(15700, 2002,36245,1999,78480))
I'd like to merge the two data frames by matching chrom == chrom, chromStart = between(chromStart2 -5, chromStart2 5)
and chromEnd = between(chromEnd2 -5, chromEnd2 5)
. What I've tried is:
library(dplyr)
colnames(df2) <- c('chrom','chromStart', 'chromEnd')
merged <- inner_join(df1,df2)
However, that only matches the exact chromStart
and chromEnd
, in our case only aaad
matches. I'd like to give it a range of plus or minus so that dfk
matches as well. My actual dataframes are 260000 rows and 179000 rows, so I would prefer a memory efficient way if possible. Here are the results I'm looking for:
data.frame(chrom = c(1,1,1),
chromStart = c(15433, 1959,1959),
chromEnd = c(15700, 2001,2001),
id = c('aaad', 'dfk', 'dfk'),
chromStart2 = c(15433, 1961,1962),
chromEnd2 = c(15700, 2002,1999))
CodePudding user response:
There may be better/more efficient ways, but these should work.
A dplyr
approach: create two temporary logic vectors based on your conditions, then filter
s based on those meeting both conditions, then drops (select
) the temporary columns:
merged <- inner_join(df1, df2) %>%
mutate(
inStart = chromStart >= chromStart2 - 5 & chromStart <= chromStart2 5,
inEnd = chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd 5) %>%
filter(inStart, inEnd) %>%
select(-inStart, -inEnd)
### or in one `mutate` command:
# merged <- inner_join(df1, df2) %>%
# mutate(inrows = (chromStart >= chromStart2 - 5 & chromStart <= #chromStart2 5) &
# (chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd 5)) %>%
# filter(inrows) %>%
# select(-inrows)
Output:
# chrom chromStart chromEnd id chromStart2 chromEnd2
# 1 1 15433 15700 aaad 15433 15700
# 2 1 1959 2001 dfk 1961 2002
# 3 1 1959 2001 dfk 1962 1999
And check to ensure it meets final desired data exactly:
all.equal(merged,
data.frame(chrom = c(1,1,1),
chromStart = c(15433, 1959,1959),
chromEnd = c(15700, 2001,2001),
id = c('aaad', 'dfk', 'dfk'),
chromStart2 = c(15433, 1961,1962),
chromEnd2 = c(15700, 2002,1999))
)
# [1] TRUE
A base R
approach: subset the data by identifying the rows that meet the same conditions
base1 <- merge(df1, df2, by = "chrom")
base_merged <- base1[(base1$chromStart >= base1$chromStart2 - 5 & base1$chromStart <= base1$chromStart2 5) &
(base1$chromEnd >= base1$chromEnd2 - 5 & base1$chromEnd <= base1$chromEnd 5),]