Please help me optimise this R function (I don't know how to use apply)-CodePudding

I need some help optimising a function. I'm aware that an apply function could speed this up, but I've never learned to use that family of functions properly and can't find advice online that I can follow easily to convert this ...

I have two data frames. They look like this (but each has >1 million rows).

df.DP:
CHR    POS 
12     463
12     5412
12     76123
431    276
431    200187
8521   23
8521   2001

df.mask:
CHR    START    END 
10     67       876
12     4321     8724
12     8742     8910
277    10293    10599
8521   1068     3233

What I want do do is add a column to df.DP which indicates whether each row matching a CHR value from df.mask also has a POS value that is greater than df.mask START and lower than df.mask END. For example...

result:
CHR    POS       mask
12     463       0
12     5412      1
12     76123     0
431    276       0
431    200187    0
8521   23        0
8521   2001      1

This is the function I've written:

index.masked <- function(df.DP, df.mask){
  #Create all 0s masked index column
  df.DP$masked = 0
  #Iterate over df
  for(i in 1:nrow(df.DP)){
    #Report progress
    print(i)
    #Check if df.DP$CHR[i] is in df.mask$CHR 
    if(df.DP$CHR[i] %in% df.mask$CHR){
      #Check if ith SNP is within masked range
      if(nrow(df.mask[which(df.mask$CHR == df.DP$CHR[i] & 
                            df.mask$START < df.DP$POS[i] & 
                            df.mask$END > df.DP$POS[i]),]) > 0){
        #Report progress
        print("Ding!")
        #Set index column to 1
        df.DP$masked[i] <- 1
      }
    }
  }
  #Return df.DP
  return(df.DP)
}

Basically this is horribly slow. As I've said, each data frame has > 1 million rows. On top of that, I have multiple data frames on which I need to perform this operation.

If anyone could please show me how to make this faster I would be very grateful.

Here is some code to generate dummy data...

test.DP <- data.frame(CHR = c("12", "12", "23", "23", "23"), 
                      POS = c(245, 6542, 12, 564, 1874))
test.mask <- data.frame(CHR = c("12", "13", "23"), 
                        START = c(150, 717, 550), 
                        END = c(270, 871, 599))

When I run my function on this dummy data it works fine.

test1 <- index.masked(test.DP, test.mask)

> test1
  CHR  POS masked
1  12  245      1
2  12 6542      0
3  23   12      0
4  23  564      1
5  23 1874      0

With > 1M rows in each data frame, this is too slow though.

I have searched extensively and while I can find plenty of posts here asking for help with similar problems, I'm so unfamiliar with the apply functions and other approaches to solving things that I just can't follow what's been done and apply (no pun intended) it to my own situation.

Any help would be deeply appreciated.

CodePudding user response：

Thank you to @MrFlick - from the linked post I found a solution which I could piece together.

    new.function <- function(df.DP, df.mask){
      #Create ID column to index all rows
      df.DP$ID = 1:nrow(df.DP)
      #Create mask index column
      df.DP$masked = 0
      #Perform a merge to get SNPs in ranges
      in.ranges = subset(merge(df.DP, df.mask), START <= POS & POS <= END)
      #Use ID in in.ranges to change df.DP$masked to 1 for rows in masked ranges
      df.DP$masked[which(df.DP$ID %in% in.ranges$ID)] = 1
      #Return df.DP
      return(df.DP %>% select(-(ID)))
    }
    
    test2 <- new.function(test.DP, test.mask)
    
    test1
      CHR  POS masked
1  12  245      1
2  12 6542      0
3  23   12      0
4  23  564      1
5  23 1874      0


    test2
      CHR  POS masked
1  12  245      1
2  12 6542      0
3  23   12      0
4  23  564      1
5  23 1874      0

The solution proposed by @Andre Wilberg looks like it's using a very similar approach. I will try the solution I've put together, and if it works then I'll mark this as the answer. If not, I will try @Andre's solution.

Thank you for your help, people :)

CodePudding user response：

You can try left_join from dplyr

library(dplyr)

left_join(df.DP, df.mask, "CHR") %>% 
  mutate(masked = (POS >= START & POS <= END) * 1, START = NULL, END = NULL)
  CHR  POS masked
1  12  245      1
2  12 6542      0
3  23   12      0
4  23  564      1
5  23 1874      0

Data

df.DP <- structure(list(CHR = c("12", "12", "23", "23", "23"), POS = c(245,
6542, 12, 564, 1874)), class = "data.frame", row.names = c(NA,
-5L))

df.mask <- structure(list(CHR = c("12", "13", "23"), START = c(150, 717,
550), END = c(270, 871, 599)), class = "data.frame", row.names = c(NA,
-3L))

EDIT: On your original data it might be necessary to remove duplicated rows.

left_join(df.DP, df.mask, "CHR") %>% 
  mutate(masked = (POS >= START & POS <= END) * 1, START = NULL, END = NULL) %>% 
  distinct() %>% group_by(CHR, POS) %>% 
  filter(!(n() > 1 & masked == 0)) %>% 
  ungroup()
# A tibble: 7 × 3
    CHR    POS masked
  <int>  <int>  <dbl>
1    12    463      0
2    12   5412      1
3    12  76123      0
4   431    276     NA
5   431 200187     NA
6  8521     23      0
7  8521   2001      1

ext. data

df.DP <- structure(list(CHR = c(12L, 12L, 12L, 431L, 431L, 8521L, 8521L
), POS = c(463L, 5412L, 76123L, 276L, 200187L, 23L, 2001L)), class = "data.frame", row.names = c(NA,
-7L))

df.mask <- structure(list(CHR = c(10L, 12L, 12L, 277L, 8521L), START = c(67L,
4321L, 8742L, 10293L, 1068L), END = c(876L, 8724L, 8910L, 10599L,
3233L)), class = "data.frame", row.names = c(NA, -5L))

CodePudding user response：

A data.table non-equi join will be fast.

Example 1

library(data.table)

test.DP <- data.frame(CHR = c("12", "12", "23", "23", "23"), 
                      POS = c(245, 6542, 12, 564, 1874))
test.mask <- data.frame(CHR = c("12", "13", "23"), 
                        START = c(150, 717, 550), 
                        END = c(270, 871, 599))

setDT(test.DP)[, mask := 0][setDT(test.mask), mask := 1, on = .(CHR == CHR, POS <= END, POS >= START)][]
#>    CHR  POS mask
#> 1:  12  245    1
#> 2:  12 6542    0
#> 3:  23   12    0
#> 4:  23  564    1
#> 5:  23 1874    0

Example 2

test.DP <- data.frame(CHR = c("12", "12", "12", "431", "431", "8521", "8521"), 
                      POS = c(463, 5412, 76123, 276, 200187, 23, 2001))
test.mask <- data.frame(CHR = c("10", "12", "12", "277", "8521"), 
                        START = c(67, 4321, 8742, 10293, 1068), 
                        END = c(876, 8724, 8910, 10599, 3233))

setDT(test.DP)[, mask := 0][setDT(test.mask), mask := 1, on = .(CHR == CHR, POS <= END, POS >= START)][]
#>     CHR    POS mask
#> 1:   12    463    0
#> 2:   12   5412    1
#> 3:   12  76123    0
#> 4:  431    276    0
#> 5:  431 200187    0
#> 6: 8521     23    0
#> 7: 8521   2001    1