I have an R dataframe containing pairs of SNPs (with chromosome number and base pair location in separate columns) and p-values for each pair. The data looks like this:
chr_snp1 loc_snp1 chr_snp2 loc_snp2 pval
1 278474050 2 57386477 7.43e-08
1 275620856 2 57386662 1.08e-07
1 144075771 3 109909704 1.02e-06
1 144075771 3 111344453 2.06e-06
2 103701229 7 56369738 3.83e-08
2 102990566 7 56407691 1.07e-07
I want to remove redundancies in pairs. However, redundancy being defined as a similar pair that is close to another determined by the values in the loc_snp1
and loc_snp2
columns as well as chr_snp1
and chr_snp2
. The numeric limit I want to impose for values in loc_snp1
and loc_snp2
is 10,000,000 for pairs on the same combination of chromosomes. This data is sorted by chromosome (chr_snp1
and chr_snp2
), then base pair location (loc_snp1
and loc_snp2
), then by p-value (pval
).
Basically I want the script to check if the value in loc_snp1
is within 10,000,000 of the value in loc_snp1
in the row above and if they both have the same value in chr_snp1
. Then, also check if the value in loc_snp2
is within 10,000,000 of the value in loc_snp2
in the row above and if they both have the same value in chr_snp2
. If all four conditions are true, only keep the row with the lowest pval and discard the other.
In order words, only keep the best pair and remove all the others that are close (have the same chromosome combination and within 10,000,000 base pairs of other pairs on their respective chromosomes).
This would result in the dataframe looking like:
chr_snp1 loc_snp1 chr_snp2 loc_snp2 pval
1 278474050 2 57386477 7.43e-08
1 144075771 3 109909704 1.02e-06
2 103701229 7 56369738 3.83e-08
Of course, the script doesn't have to actually check the row above. I assume there are more elegant ways to approach the problem.
CodePudding user response:
Here is option. If you already know chr_snp1 and chr_snp2 need to be identical, you can start by grouping them and then calculating the the differences. Admittedly, this might fail if you have a duplicate chr_snp1 somewhere else in the dataset.
library(tidyverse)
df |>
group_by(chr_snp1, chr_snp2) |>
mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 10e6) &
(abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 10e6)) |>
group_by(grp, .add=TRUE) |>
filter(pval == min(pval)) |>
ungroup()|>
select(-grp)
#> # A tibble: 3 x 5
#> chr_snp1 loc_snp1 chr_snp2 loc_snp2 pval
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 278474050 2 57386477 0.0000000743
#> 2 1 144075771 3 109909704 0.00000102
#> 3 2 103701229 7 56369738 0.0000000383
CodePudding user response:
df <- textConnection('
chr_snp1, loc_snp1, chr_snp2, loc_snp2, pval
1, 278474050, 2, 57386477, 7.43e-08
1, 275620856, 2, 57386662, 1.08e-07
1, 194406449, 2, 182907442, 2.22e-06
1, 144075771, 3, 109909704, 1.02e-06
1, 144075771, 3, 111344453, 1.02e-06
2, 103701229, 7, 56369738, 1.02e-06
2, 102990566, 7, 56407691, 1.02e-06
') |> read.csv(header = TRUE)
# filter based on first 4 criteria
threshold <- 1e7
idx <- (
((diff(df$loc_snp1) |> abs()) <= threshold) &
(diff(df$chr_snp1) == 0) &
((diff(df$loc_snp2) |> abs()) <= threshold) &
(diff(df$chr_snp2) == 0)
) |> which()
df <- df[ idx, ]
# now retain only the pair with lowest p-value
df <- split(df, paste0(df$chr_snp1, df$chr_snp2)) |>
lapply(function(x) {
idx <- which.min(x$pval)
x[ idx, ]
}) |>
data.table::rbindlist() |>
as.data.frame()
# now retain only the pair with lowest p-value (mirror redundancies)
df <- split(df, paste0(df$chr_snp2, df$chr_snp1)) |>
lapply(function(x) {
idx <- which.min(x$pval)
x[ idx, ]
}) |>
data.table::rbindlist() |>
as.data.frame()
print(df)
chr_snp1 loc_snp1 chr_snp2 loc_snp2 pval
1 1 278474050 2 57386477 7.43e-08
3 1 144075771 3 109909704 1.02e-06
5 2 103701229 7 56369738 1.02e-06
CodePudding user response:
A friend was able to code how to handle mirror duplicates after AndS's initial pruning step.
Cart_Inter_Pruned$skip = FALSE
df = NULL
threshold = 10e7
for(i in 1:nrow(Cart_Inter_Pruned)){
if(Cart_Inter_Pruned[i,]$skip) next
ordered_key1 = sort(array(unlist(Cart_Inter_Pruned[i,c(1,3)])))
for(j in (i 1):nrow(Cart_Inter_Pruned)){
if(Cart_Inter_Pruned[j,]$skip) next
ordered_key2 = sort(array(unlist(Cart_Inter_Pruned[j,c(1,3)])))
if(paste(ordered_key1,collapse=",") == paste(ordered_key2,collapse = ",")){
if_switch = ordered_key1[1] == ordered_key2[2]
print(paste(i,j,ordered_key1))
if(if_switch){
if(abs(Cart_Inter_Pruned[i,2]- Cart_Inter_Pruned[j,4]) <= threshold && abs(Cart_Inter_Pruned[i,4] - Cart_Inter_Pruned[j,2]) <= threshold){
Cart_Inter_Pruned[j,]$skip = TRUE
Cart_Inter_Pruned[i,5] = min(as.numeric(Cart_Inter_Pruned[i,5] ,as.numeric(Cart_Inter_Pruned[j,5])))
}
}else{
if(abs(Cart_Inter_Pruned[i,2]- Cart_Inter_Pruned[j,2]) <= threshold && abs(Cart_Inter_Pruned[i,4] - Cart_Inter_Pruned[j,4]) <= threshold){
Cart_Inter_Pruned[j,]$skip = TRUE
Cart_Inter_Pruned[i,5] = min(as.numeric(Cart_Inter_Pruned[i,5] ,as.numeric(Cart_Inter_Pruned[j,5])))
}
}
}
}
}