I have two data frames of different lengths, like : df1
locusnum CHR MinBP MaxBP
1: 1 1 13982248 14126651
2: 2 1 21538708 21560253
3: 3 1 28892760 28992798
4: 4 1 43760070 43927877
5: 5 1 149999059 150971195
6: 6 1 200299701 200441048
df2
position chr
27751 13982716 1
27750 13982728 1
10256 13984208 1
27729 13985591 1
27730 13988076 1
27731 13988403 1
both dfs
has other columns. df2
has 60000
rows and df1
has 64
rows.
I want to populate a new column in df2
with locusnum
from df1
. The condition would be df2$chr == df1$CHR & df2$position %in% df1$MinBP:df1$MaxBP
My expected output would be
position chr locusnum
27751 13982716 1 1
27750 13982728 1 1
10256 13984208 1 1
27729 13985591 1 1
27730 13988076 1 1
27731 13988403 1 1
So far I have tried with ifelse
statement and for loop as below:
if (df2$chr == df1$CHR & df2$position >= df1$MinBP & df2$position <= df1$MaxBP) df2$locusnum=df1$locusnum
and
for(i in 1:length(df2$position)){ #runs the following code for each line
if(df2$chr[i] == df1$CHR & df2$position[i] %in% df1$MinBP:df1$MaxBP){ #if logical TRUE then it runs the next line
df2$locusnum[i] <- df1$locusnum #gives value of another column to a new column
but got error:
the condition has length > 1
longer object length is not a multiple of shorter object length
Any help? Did I explain the issue clearly? } }
CodePudding user response:
Using foverlaps(...)
from the data.table
package.
Your example is uninteresting because all the rows correspond to locusnum = 1
, so I changed df2
a little bit to demonstrate how this works.
##
# df1 is as you provided it
# in df2: note changes to position column in row 2, 3, and 6
#
df2 <- read.table(text="
id position chr
27751 13982716 1
27750 21538718 1
10256 43760080 1
27729 13985591 1
27730 13988076 1
27731 200299711 1", header=TRUE)
##
# you start here
#
library(data.table)
setDT(df1)
setDT(df2)
df2[, c('indx', 'start', 'end'):=.(seq(.N), position, position)]
setkey(df1, CHR, MinBP, MaxBP)
setkey(df2, chr, start, end)
result <- foverlaps(df2, df1)[order(indx), .(id, position, chr, locusnum)]
## id position chr locusnum
## 1: 27751 13982716 1 1
## 2: 27750 21538718 1 2
## 3: 10256 43760080 1 4
## 4: 27729 13985591 1 1
## 5: 27730 13988076 1 1
## 6: 27731 200299711 1 6
foverlaps(...)
works best if both data.table
s are keyed, but this changes the order of the rows in df2
, so I added an index
column to recover the original ordering, then removed it at the end.
This should be extremely fast but 60,000 rows is a tiny data-set tbh so you might not notice a difference.
CodePudding user response:
Given data frames are not too large, consider merge
on chr/Chr fields with a subset
for range and return only locusnum column. Then, merge
against original data frame to return full set by row.names
:
merged_df <- merge(
df2, df1, by.x="chr", by.y="CHR"
) |> subset(
subset = (position >= MinBP & position <= MaxBP),
select = locusnum
) |> merge(
x=df2, y=_, by="row.names"
)