Prune list of pairs in R dataframe by removing redundancies based on genomic distance and p-value-CodePudding

I have an R dataframe containing pairs of SNPs (with chromosome number and base pair location in separate columns) and p-values for each pair. The data looks like this:

chr_snp1     loc_snp1     chr_snp2     loc_snp2     pval
1            278474050    2            57386477     7.43e-08
1            275620856    2            57386662     1.08e-07
1            144075771    3            109909704    1.02e-06
1            144075771    3            111344453    2.06e-06
2            103701229    7            56369738     3.83e-08
2            102990566    7            56407691     1.07e-07

I want to remove redundancies in pairs. However, redundancy being defined as a similar pair that is close to another determined by the values in the loc_snp1 and loc_snp2 columns as well as chr_snp1 and chr_snp2. The numeric limit I want to impose for values in loc_snp1 and loc_snp2 is 10,000,000 for pairs on the same combination of chromosomes. This data is sorted by chromosome (chr_snp1 and chr_snp2), then base pair location (loc_snp1 and loc_snp2), then by p-value (pval).

Basically I want the script to check if the value in loc_snp1 is within 10,000,000 of the value in loc_snp1 in the row above and if they both have the same value in chr_snp1. Then, also check if the value in loc_snp2 is within 10,000,000 of the value in loc_snp2 in the row above and if they both have the same value in chr_snp2. If all four conditions are true, only keep the row with the lowest pval and discard the other.

In order words, only keep the best pair and remove all the others that are close (have the same chromosome combination and within 10,000,000 base pairs of other pairs on their respective chromosomes).

This would result in the dataframe looking like:

chr_snp1     loc_snp1     chr_snp2     loc_snp2     pval
1            278474050    2            57386477     7.43e-08
1            144075771    3            109909704    1.02e-06
2            103701229    7            56369738     3.83e-08

Of course, the script doesn't have to actually check the row above. I assume there are more elegant ways to approach the problem.

CodePudding user response：

Here is option. If you already know chr_snp1 and chr_snp2 need to be identical, you can start by grouping them and then calculating the the differences. Admittedly, this might fail if you have a duplicate chr_snp1 somewhere else in the dataset.

library(tidyverse)

df |>
  group_by(chr_snp1, chr_snp2) |>
  mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 10e6) &
           (abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 10e6)) |>
  group_by(grp, .add=TRUE)  |>
  filter(pval == min(pval)) |>
  ungroup()|>
  select(-grp)
#> # A tibble: 3 x 5
#>   chr_snp1  loc_snp1 chr_snp2  loc_snp2         pval
#>      <dbl>     <dbl>    <dbl>     <dbl>        <dbl>
#> 1        1 278474050        2  57386477 0.0000000743
#> 2        1 144075771        3 109909704 0.00000102  
#> 3        2 103701229        7  56369738 0.0000000383

CodePudding user response：

df <- textConnection('
chr_snp1, loc_snp1, chr_snp2, loc_snp2, pval
1, 278474050, 2, 57386477, 7.43e-08
1, 275620856, 2, 57386662, 1.08e-07
1, 194406449, 2, 182907442, 2.22e-06
1, 144075771, 3, 109909704, 1.02e-06
1, 144075771, 3, 111344453, 1.02e-06
2, 103701229, 7, 56369738, 1.02e-06
2, 102990566, 7, 56407691, 1.02e-06
') |> read.csv(header = TRUE)

# filter based on first 4 criteria
threshold <- 1e7
idx <- (
  ((diff(df$loc_snp1) |> abs()) <= threshold) &
    (diff(df$chr_snp1) == 0) &
    ((diff(df$loc_snp2) |> abs()) <= threshold) &
    (diff(df$chr_snp2) == 0)
) |> which()
df <- df[ idx, ]

# now retain only the pair with lowest p-value
df <- split(df, paste0(df$chr_snp1, df$chr_snp2)) |>
  lapply(function(x) {
    idx <- which.min(x$pval)
    x[ idx, ]
  }) |> 
  data.table::rbindlist() |> 
  as.data.frame()

# now retain only the pair with lowest p-value (mirror redundancies)
df <- split(df, paste0(df$chr_snp2, df$chr_snp1)) |>
  lapply(function(x) {
    idx <- which.min(x$pval)
    x[ idx, ]
  }) |> 
  data.table::rbindlist() |> 
  as.data.frame()

print(df)

  chr_snp1  loc_snp1 chr_snp2  loc_snp2     pval
1        1 278474050        2  57386477 7.43e-08
3        1 144075771        3 109909704 1.02e-06
5        2 103701229        7  56369738 1.02e-06

CodePudding user response：

A friend was able to code how to handle mirror duplicates after AndS's initial pruning step.

Cart_Inter_Pruned$skip = FALSE
df = NULL
threshold = 10e7
for(i in 1:nrow(Cart_Inter_Pruned)){
  if(Cart_Inter_Pruned[i,]$skip) next
  ordered_key1 = sort(array(unlist(Cart_Inter_Pruned[i,c(1,3)])))
  
  for(j in (i 1):nrow(Cart_Inter_Pruned)){
    if(Cart_Inter_Pruned[j,]$skip) next
    ordered_key2 = sort(array(unlist(Cart_Inter_Pruned[j,c(1,3)])))
    if(paste(ordered_key1,collapse=",") == paste(ordered_key2,collapse = ",")){
      if_switch = ordered_key1[1] == ordered_key2[2]
      print(paste(i,j,ordered_key1))
      if(if_switch){
        if(abs(Cart_Inter_Pruned[i,2]- Cart_Inter_Pruned[j,4]) <= threshold && abs(Cart_Inter_Pruned[i,4] - Cart_Inter_Pruned[j,2]) <= threshold){
          Cart_Inter_Pruned[j,]$skip = TRUE
          Cart_Inter_Pruned[i,5] = min(as.numeric(Cart_Inter_Pruned[i,5] ,as.numeric(Cart_Inter_Pruned[j,5])))
        }
      }else{
        if(abs(Cart_Inter_Pruned[i,2]- Cart_Inter_Pruned[j,2]) <= threshold && abs(Cart_Inter_Pruned[i,4] - Cart_Inter_Pruned[j,4]) <= threshold){
          Cart_Inter_Pruned[j,]$skip = TRUE
          Cart_Inter_Pruned[i,5] = min(as.numeric(Cart_Inter_Pruned[i,5] ,as.numeric(Cart_Inter_Pruned[j,5])))
        }
      }
    }
  }
}