I have two dataframes of different length. The first looks like this and is the dataframe I want to add the True/False column to:
chr_snp loc_snp ma_snp
1 184319928 T
1 276998062 A
1 278255864 G
2 243012470 G
2 123072103 T
3 526785124 A
The second data frame is the reference dataframe that is smaller:
chr_QTL loc_QTL ma_QTL
1 281788173 G
1 203085725 C
2 241577141 C
For each row in dataframe 1 (df1
), I want to first check if the value of df1$chr_snp
matches a value in df2$chr_QTL. If this match is true, then I want to determine if the value in df1$loc_snp
is within 10 million units (these are DNA base-pairs) above OR below any values based on the first condition in df2$loc_QTL
. Now, what is tricky is that for the first three rows of df1
, there are three possible row matches in df2
(rows 1 and 2) based on the first criteria alone. However, only two match based on the second criteria (10M base-pairs greater than OR less than value in df2$loc_QTL
). Note: df1$ma_snp
and df2$ma_QTL
can be totally ignored. So, based on these criteria, df1
should now look like:
chr_snp loc_snp ma_snp Match
1 184319928 T FALSE
1 276998062 A TRUE
1 278255864 G TRUE
2 243012470 G TRUE
2 123072103 T FALSE
3 526785124 A FALSE
CodePudding user response:
Here is one option. Join the data then look for any instance of < 10,000,000.
library(tidyverse)
left_join(df1 |>
mutate(rw_id = row_number()),
df2, by = c("chr_snp" = "chr_QTL")) |>
mutate(less = abs(loc_snp -loc_QTL) < 10e6) |>
group_by(rw_id)|>
summarise(across(contains(colnames(df1)), ~.[[1]]),
Match = any(less),
Match = ifelse(is.na(Match), FALSE, Match))
#> # A tibble: 6 x 5
#> rw_id chr_snp loc_snp ma_snp Match
#> <int> <dbl> <dbl> <chr> <lgl>
#> 1 1 1 184319928 T FALSE
#> 2 2 1 276998062 A TRUE
#> 3 3 1 278255864 G TRUE
#> 4 4 2 243012470 G TRUE
#> 5 5 2 123072103 T FALSE
#> 6 6 3 526785124 A FALSE
or another option:
library(tidyverse)
df1 |>
mutate(Match = map2_lgl(chr_snp, loc_snp,
\(x, y){
(x %in% df2$chr_QTL) &&
any(abs(df2[df2$chr_QTL == x, 'loc_QTL']-y) < 10e6)
}))
#> # A tibble: 6 x 4
#> chr_snp loc_snp ma_snp Match
#> <dbl> <dbl> <chr> <lgl>
#> 1 1 184319928 T FALSE
#> 2 1 276998062 A TRUE
#> 3 1 278255864 G TRUE
#> 4 2 243012470 G TRUE
#> 5 2 123072103 T FALSE
#> 6 3 526785124 A FALSE
just for fun. Here is a third option with base:
df1$Match <- apply(outer(df1$loc_snp, df2$loc_QTL, \(x,y) abs(x-y) < 10e6) &
outer(df1$chr_snp, df2$chr_QTL, `==`), 1, any)
df1
#> # A tibble: 6 x 4
#> chr_snp loc_snp ma_snp Match
#> <dbl> <dbl> <chr> <lgl>
#> 1 1 184319928 T FALSE
#> 2 1 276998062 A TRUE
#> 3 1 278255864 G TRUE
#> 4 2 243012470 G TRUE
#> 5 2 123072103 T FALSE
#> 6 3 526785124 A FALSE
CodePudding user response:
One way:
split(df1, 1:NROW(df1)) <- lapply(split(df1, 1:NROW(df1)), function(x)
x$Match <- any(df2$chr_QTL==x$chr_snp & abs(df2$loc_QTL - x$loc_snp) < 1e7))