I need some help optimising a function. I'm aware that an apply function could speed this up, but I've never learned to use that family of functions properly and can't find advice online that I can follow easily to convert this ...
I have two data frames. They look like this (but each has >1 million rows).
df.DP:
CHR POS
12 463
12 5412
12 76123
431 276
431 200187
8521 23
8521 2001
df.mask:
CHR START END
10 67 876
12 4321 8724
12 8742 8910
277 10293 10599
8521 1068 3233
What I want do do is add a column to df.DP which indicates whether each row matching a CHR value from df.mask also has a POS value that is greater than df.mask START and lower than df.mask END. For example...
result:
CHR POS mask
12 463 0
12 5412 1
12 76123 0
431 276 0
431 200187 0
8521 23 0
8521 2001 1
This is the function I've written:
index.masked <- function(df.DP, df.mask){
#Create all 0s masked index column
df.DP$masked = 0
#Iterate over df
for(i in 1:nrow(df.DP)){
#Report progress
print(i)
#Check if df.DP$CHR[i] is in df.mask$CHR
if(df.DP$CHR[i] %in% df.mask$CHR){
#Check if ith SNP is within masked range
if(nrow(df.mask[which(df.mask$CHR == df.DP$CHR[i] &
df.mask$START < df.DP$POS[i] &
df.mask$END > df.DP$POS[i]),]) > 0){
#Report progress
print("Ding!")
#Set index column to 1
df.DP$masked[i] <- 1
}
}
}
#Return df.DP
return(df.DP)
}
Basically this is horribly slow. As I've said, each data frame has > 1 million rows. On top of that, I have multiple data frames on which I need to perform this operation.
If anyone could please show me how to make this faster I would be very grateful.
Here is some code to generate dummy data...
test.DP <- data.frame(CHR = c("12", "12", "23", "23", "23"),
POS = c(245, 6542, 12, 564, 1874))
test.mask <- data.frame(CHR = c("12", "13", "23"),
START = c(150, 717, 550),
END = c(270, 871, 599))
When I run my function on this dummy data it works fine.
test1 <- index.masked(test.DP, test.mask)
> test1
CHR POS masked
1 12 245 1
2 12 6542 0
3 23 12 0
4 23 564 1
5 23 1874 0
With > 1M rows in each data frame, this is too slow though.
I have searched extensively and while I can find plenty of posts here asking for help with similar problems, I'm so unfamiliar with the apply functions and other approaches to solving things that I just can't follow what's been done and apply (no pun intended) it to my own situation.
Any help would be deeply appreciated.
CodePudding user response:
Thank you to @MrFlick - from the linked post I found a solution which I could piece together.
new.function <- function(df.DP, df.mask){
#Create ID column to index all rows
df.DP$ID = 1:nrow(df.DP)
#Create mask index column
df.DP$masked = 0
#Perform a merge to get SNPs in ranges
in.ranges = subset(merge(df.DP, df.mask), START <= POS & POS <= END)
#Use ID in in.ranges to change df.DP$masked to 1 for rows in masked ranges
df.DP$masked[which(df.DP$ID %in% in.ranges$ID)] = 1
#Return df.DP
return(df.DP %>% select(-(ID)))
}
test2 <- new.function(test.DP, test.mask)
test1
CHR POS masked
1 12 245 1
2 12 6542 0
3 23 12 0
4 23 564 1
5 23 1874 0
test2
CHR POS masked
1 12 245 1
2 12 6542 0
3 23 12 0
4 23 564 1
5 23 1874 0
The solution proposed by @Andre Wilberg looks like it's using a very similar approach. I will try the solution I've put together, and if it works then I'll mark this as the answer. If not, I will try @Andre's solution.
Thank you for your help, people :)
CodePudding user response:
You can try left_join
from dplyr
library(dplyr)
left_join(df.DP, df.mask, "CHR") %>%
mutate(masked = (POS >= START & POS <= END) * 1, START = NULL, END = NULL)
CHR POS masked
1 12 245 1
2 12 6542 0
3 23 12 0
4 23 564 1
5 23 1874 0
Data
df.DP <- structure(list(CHR = c("12", "12", "23", "23", "23"), POS = c(245,
6542, 12, 564, 1874)), class = "data.frame", row.names = c(NA,
-5L))
df.mask <- structure(list(CHR = c("12", "13", "23"), START = c(150, 717,
550), END = c(270, 871, 599)), class = "data.frame", row.names = c(NA,
-3L))
EDIT: On your original data it might be necessary to remove duplicated rows.
left_join(df.DP, df.mask, "CHR") %>%
mutate(masked = (POS >= START & POS <= END) * 1, START = NULL, END = NULL) %>%
distinct() %>% group_by(CHR, POS) %>%
filter(!(n() > 1 & masked == 0)) %>%
ungroup()
# A tibble: 7 × 3
CHR POS masked
<int> <int> <dbl>
1 12 463 0
2 12 5412 1
3 12 76123 0
4 431 276 NA
5 431 200187 NA
6 8521 23 0
7 8521 2001 1
ext. data
df.DP <- structure(list(CHR = c(12L, 12L, 12L, 431L, 431L, 8521L, 8521L
), POS = c(463L, 5412L, 76123L, 276L, 200187L, 23L, 2001L)), class = "data.frame", row.names = c(NA,
-7L))
df.mask <- structure(list(CHR = c(10L, 12L, 12L, 277L, 8521L), START = c(67L,
4321L, 8742L, 10293L, 1068L), END = c(876L, 8724L, 8910L, 10599L,
3233L)), class = "data.frame", row.names = c(NA, -5L))
CodePudding user response:
A data.table
non-equi join will be fast.
Example 1
library(data.table)
test.DP <- data.frame(CHR = c("12", "12", "23", "23", "23"),
POS = c(245, 6542, 12, 564, 1874))
test.mask <- data.frame(CHR = c("12", "13", "23"),
START = c(150, 717, 550),
END = c(270, 871, 599))
setDT(test.DP)[, mask := 0][setDT(test.mask), mask := 1, on = .(CHR == CHR, POS <= END, POS >= START)][]
#> CHR POS mask
#> 1: 12 245 1
#> 2: 12 6542 0
#> 3: 23 12 0
#> 4: 23 564 1
#> 5: 23 1874 0
Example 2
test.DP <- data.frame(CHR = c("12", "12", "12", "431", "431", "8521", "8521"),
POS = c(463, 5412, 76123, 276, 200187, 23, 2001))
test.mask <- data.frame(CHR = c("10", "12", "12", "277", "8521"),
START = c(67, 4321, 8742, 10293, 1068),
END = c(876, 8724, 8910, 10599, 3233))
setDT(test.DP)[, mask := 0][setDT(test.mask), mask := 1, on = .(CHR == CHR, POS <= END, POS >= START)][]
#> CHR POS mask
#> 1: 12 463 0
#> 2: 12 5412 1
#> 3: 12 76123 0
#> 4: 431 276 0
#> 5: 431 200187 0
#> 6: 8521 23 0
#> 7: 8521 2001 1