if values of a column is in between two columns in R, populate a new column-CodePudding

I have two data frames of different lengths, like : df1

       locusnum CHR     MinBP     MaxBP
1:        1   1  13982248  14126651
2:        2   1  21538708  21560253
3:        3   1  28892760  28992798
4:        4   1  43760070  43927877
5:        5   1 149999059 150971195
6:        6   1 200299701 200441048

df2

position chr
27751 13982716   1
27750 13982728   1
10256 13984208   1
27729 13985591   1
27730 13988076   1
27731 13988403   1

both dfs has other columns. df2 has 60000 rows and df1 has 64 rows.

I want to populate a new column in df2 with locusnum from df1. The condition would be df2$chr == df1$CHR & df2$position %in% df1$MinBP:df1$MaxBP

My expected output would be

          position chr locusnum
27751 13982716   1  1
27750 13982728   1  1
10256 13984208   1  1
27729 13985591   1  1
27730 13988076   1  1
27731 13988403   1  1

So far I have tried with ifelse statement and for loop as below:

if (df2$chr == df1$CHR & df2$position >= df1$MinBP & df2$position <= df1$MaxBP) df2$locusnum=df1$locusnum

and

 for(i in 1:length(df2$position)){        #runs the following code for each line
      if(df2$chr[i] == df1$CHR & df2$position[i] %in% df1$MinBP:df1$MaxBP){              #if logical TRUE then it runs the next line
        df2$locusnum[i] <- df1$locusnum    #gives value of another column to a new column

but got error:

the condition has length > 1

longer object length is not a multiple of shorter object length

Any help? Did I explain the issue clearly? } }

CodePudding user response：

Using foverlaps(...) from the data.table package.

Your example is uninteresting because all the rows correspond to locusnum = 1, so I changed df2 a little bit to demonstrate how this works.

##
#  df1 is as you provided it
#  in df2: note changes to position column in row 2, 3, and 6
#
df2 <- read.table(text="
id    position  chr
27751 13982716    1
27750 21538718    1
10256 43760080    1
27729 13985591    1
27730 13988076    1
27731 200299711   1", header=TRUE)
##
#   you start here
#
library(data.table)
setDT(df1)
setDT(df2)
df2[, c('indx', 'start', 'end'):=.(seq(.N), position, position)]
setkey(df1, CHR, MinBP, MaxBP)
setkey(df2, chr, start, end)
result <- foverlaps(df2, df1)[order(indx), .(id, position, chr, locusnum)]
##       id  position chr locusnum
## 1: 27751  13982716   1        1
## 2: 27750  21538718   1        2
## 3: 10256  43760080   1        4
## 4: 27729  13985591   1        1
## 5: 27730  13988076   1        1
## 6: 27731 200299711   1        6

foverlaps(...) works best if both data.tables are keyed, but this changes the order of the rows in df2, so I added an index column to recover the original ordering, then removed it at the end.

This should be extremely fast but 60,000 rows is a tiny data-set tbh so you might not notice a difference.

CodePudding user response：

Given data frames are not too large, consider merge on chr/Chr fields with a subset for range and return only locusnum column. Then, merge against original data frame to return full set by row.names:

merged_df <- merge(
    df2, df1, by.x="chr", by.y="CHR"
) |> subset(
    subset = (position >= MinBP & position <= MaxBP),
    select = locusnum
) |> merge(
    x=df2, y=_, by="row.names"
)