Check that at least 2 columns in a matrix have at least 3 values... But they have to be in the same-CodePudding

Say I have a matrix like the following:

set.seed(123)
newmat=matrix(rnorm(25),ncol=5)
colnames(newmat)=paste0('mark',1:5)
rownames(newmat)=paste0('id',1:5)
newmat[,2]=NA
newmat[c(2,5),4]=NA
newmat[c(1,4,5),5]=NA
newmat[1,1]=NA
newmat[5,3]=NA

> newmat
          mark1 mark2     mark3      mark4      mark5
id1          NA    NA 1.2240818  1.7869131         NA
id2 -0.23017749    NA 0.3598138         NA -0.2179749
id3  1.55870831    NA 0.4007715 -1.9666172 -1.0260044
id4  0.07050839    NA 0.1106827  0.7013559         NA
id5  0.12928774    NA        NA         NA         NA

The only thing I want to check here in an easy way, is that there are at least 2 columns with 3 values, but also, that those columns have the values in the same rows...

In the case above, I have the pair of columns 1 and 3 fulfilling this, as well as the pair of columns 3 and 4... the pair of columns 1 and 4 wouldn't fulfill this. For a total of 3 columns.

How could I do this check in R? I know I'd do something involving colSums(!is.na(newmat)) but not sure about the rest... Thanks!

CodePudding user response：

Here is a matrix (obtained by using crossprod is.na) that shows which pairs fullfil your objective

> `diag<-`(crossprod(!is.na(newmat)), 0) >= 3
      mark1 mark2 mark3 mark4 mark5
mark1 FALSE FALSE  TRUE FALSE FALSE
mark2 FALSE FALSE FALSE FALSE FALSE
mark3  TRUE FALSE FALSE  TRUE FALSE
mark4 FALSE FALSE  TRUE FALSE FALSE
mark5 FALSE FALSE FALSE FALSE FALSE

as we can see, pairs (mark1, mark3) and (mark3, mark4) are the desired output.

CodePudding user response：

Here's one way to do it.

First, create a data frame of all the possible column pairings, excluding self-pairings:

pairs <- expand.grid(a = colnames(newmat), b = colnames(newmat))
pairs <- pairs[pairs$a != pairs$b,]

Now, for each row in this data frame, use the entries in column a and b to extract the relevant columns from newmat. Count the number of entries that are both non-NA in each column pair, and store it as a column in pairs. This can all be done with an apply call:

pairs$matches <- apply(pairs, 1, function(row) {
  sum(!is.na(newmat[,row[1]]) & !is.na(newmat[,row[2]]))
  })

Now filter out the rows of pairs where there were less than 3 matches:

pairs <- pairs[pairs$matches > 2,]

Now pairs looks like this:

pairs
#>        a     b matches
#> 3  mark3 mark1       3
#> 11 mark1 mark3       3
#> 14 mark4 mark3       3
#> 18 mark3 mark4       3

If we unlist the first two columns, find all the unique values and sort them, we have a vector of the column names we want, so we use this to subset the matrix to remove the redundant columns:

newmat[,sort(unique(as.character(unlist(pairs[1:2]))))]
#>           mark1     mark3      mark4
#> id1          NA 1.2240818  1.7869131
#> id2 -0.23017749 0.3598138         NA
#> id3  1.55870831 0.4007715 -1.9666172
#> id4  0.07050839 0.1106827  0.7013559
#> id5  0.12928774        NA         NA