Prune a dataframe based on row-wise column similarity in R-CodePudding

I have a very large dataframe of genomic loci with genotypes scored as 0, 1, or 2. Here is a very small sample that I think gets at the issue:

x1  x2  x3  x4
0   0   1   0
0   0   1   0
1   1   2   1
1   1   1   1
2   2   0   1
2   2   1   2

Loci x1 and x2 are identical while x4 is highly similar. What I am hoping to achieve is to create a function, or use one that already exists, to assign similarity scores, row-wise, for each of my loci and then prune the dataset based on a threshold similarity that I set.

For example, if I set the threshold at 1 (100%), it would prune only x1 and x2 as they are duplicates - which I know how to do. However, if I set the threshold at 0.8, or 80% similarity, it would also prune x4 in addition to x1 and x2.

It's important that the function acts on row-wise similarity and doesn't just compare that columns have similar distributions of 0's, 1's, and 2's.

CodePudding user response：

Here's how I would approach this.

First, get a listing of all the unique pairings of your column names:

pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]

pairs
#>    Var1 Var2
#> 2    x2   x1
#> 3    x3   x1
#> 4    x4   x1
#> 7    x3   x2
#> 8    x4   x2
#> 12   x4   x3

Now iterate through this to compare the proportion of rows that are identical in each unique pair of columns of your original data set. This gives you a similarity score between 0 to 1 for each column pair:

pairs$similarity <- apply(pairs, 1, function(x) sum(df[x[1]] == df[x[2]])/nrow(df))

pairs
#>    Var1 Var2 similarity
#> 2    x2   x1  1.0000000
#> 3    x3   x1  0.1666667
#> 4    x4   x1  0.8333333
#> 7    x3   x2  0.1666667
#> 8    x4   x2  0.8333333
#> 12   x4   x3  0.1666667

Now remove all the rows of this listing that have a similarity score below your chosen threshold (we'll make it 0.8 here)

pairs <- pairs[which(pairs$similarity > 0.8),]

pairs
#>   Var1 Var2 similarity
#> 2   x2   x1  1.0000000
#> 4   x4   x1  0.8333333
#> 8   x4   x2  0.8333333

Now we extract all the unique column names in Var1 and Var2, since these are the columns that are similar to at least one other column:

keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
#> [1] "x1" "x2" "x4"

And we subset our original data frame using this to get our desired result:

df[match(keep_cols, names(df))]
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

Of course, you could put all this in a function to make it easier to adjust your threshold and apply iteratively:

remove_dissimilar <- function(df, threshold = 0.8) {
  
  pairs <- expand.grid(names(df), names(df))
  pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
  pairs$similarity <- apply(pairs, 1, function(x) {
    sum(df[x[1]] == df[x[2]])/nrow(df)})
  pairs <- pairs[which(pairs$similarity > threshold),]
  keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
  df[match(keep_cols, names(df))]
}

So now you could just do:

remove_dissimilar(df, 0.8)
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

remove_dissimilar(df, 0.9)
#>   x1 x2
#> 1  0  0
#> 2  0  0
#> 3  1  1
#> 4  1  1
#> 5  2  2
#> 6  2  2