I have a very large dataframe of genomic loci with genotypes scored as 0, 1, or 2. Here is a very small sample that I think gets at the issue:
x1 x2 x3 x4
0 0 1 0
0 0 1 0
1 1 2 1
1 1 1 1
2 2 0 1
2 2 1 2
Loci x1 and x2 are identical while x4 is highly similar. What I am hoping to achieve is to create a function, or use one that already exists, to assign similarity scores, row-wise, for each of my loci and then prune the dataset based on a threshold similarity that I set.
For example, if I set the threshold at 1 (100%), it would prune only x1 and x2 as they are duplicates - which I know how to do. However, if I set the threshold at 0.8, or 80% similarity, it would also prune x4 in addition to x1 and x2.
It's important that the function acts on row-wise similarity and doesn't just compare that columns have similar distributions of 0's, 1's, and 2's.
CodePudding user response:
Here's how I would approach this.
First, get a listing of all the unique pairings of your column names:
pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
pairs
#> Var1 Var2
#> 2 x2 x1
#> 3 x3 x1
#> 4 x4 x1
#> 7 x3 x2
#> 8 x4 x2
#> 12 x4 x3
Now iterate through this to compare the proportion of rows that are identical in each unique pair of columns of your original data set. This gives you a similarity score between 0 to 1 for each column pair:
pairs$similarity <- apply(pairs, 1, function(x) sum(df[x[1]] == df[x[2]])/nrow(df))
pairs
#> Var1 Var2 similarity
#> 2 x2 x1 1.0000000
#> 3 x3 x1 0.1666667
#> 4 x4 x1 0.8333333
#> 7 x3 x2 0.1666667
#> 8 x4 x2 0.8333333
#> 12 x4 x3 0.1666667
Now remove all the rows of this listing that have a similarity score below your chosen threshold (we'll make it 0.8 here)
pairs <- pairs[which(pairs$similarity > 0.8),]
pairs
#> Var1 Var2 similarity
#> 2 x2 x1 1.0000000
#> 4 x4 x1 0.8333333
#> 8 x4 x2 0.8333333
Now we extract all the unique column names in Var1
and Var2
, since these are the columns that are similar to at least one other column:
keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
#> [1] "x1" "x2" "x4"
And we subset our original data frame using this to get our desired result:
df[match(keep_cols, names(df))]
#> x1 x2 x4
#> 1 0 0 0
#> 2 0 0 0
#> 3 1 1 1
#> 4 1 1 1
#> 5 2 2 1
#> 6 2 2 2
Of course, you could put all this in a function to make it easier to adjust your threshold and apply iteratively:
remove_dissimilar <- function(df, threshold = 0.8) {
pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
pairs$similarity <- apply(pairs, 1, function(x) {
sum(df[x[1]] == df[x[2]])/nrow(df)})
pairs <- pairs[which(pairs$similarity > threshold),]
keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
df[match(keep_cols, names(df))]
}
So now you could just do:
remove_dissimilar(df, 0.8)
#> x1 x2 x4
#> 1 0 0 0
#> 2 0 0 0
#> 3 1 1 1
#> 4 1 1 1
#> 5 2 2 1
#> 6 2 2 2
remove_dissimilar(df, 0.9)
#> x1 x2
#> 1 0 0
#> 2 0 0
#> 3 1 1
#> 4 1 1
#> 5 2 2
#> 6 2 2