Say I have a matrix like the following:
set.seed(123)
newmat=matrix(rnorm(25),ncol=5)
colnames(newmat)=paste0('mark',1:5)
rownames(newmat)=paste0('id',1:5)
newmat[,2]=NA
newmat[c(2,5),4]=NA
newmat[c(1,4,5),5]=NA
newmat[1,1]=NA
newmat[5,3]=NA
> newmat
mark1 mark2 mark3 mark4 mark5
id1 NA NA 1.2240818 1.7869131 NA
id2 -0.23017749 NA 0.3598138 NA -0.2179749
id3 1.55870831 NA 0.4007715 -1.9666172 -1.0260044
id4 0.07050839 NA 0.1106827 0.7013559 NA
id5 0.12928774 NA NA NA NA
The only thing I want to check here in an easy way, is that there are at least 2 columns with 3 values, but also, that those columns have the values in the same rows...
In the case above, I have the pair of columns 1 and 3 fulfilling this, as well as the pair of columns 3 and 4... the pair of columns 1 and 4 wouldn't fulfill this. For a total of 3 columns.
How could I do this check in R? I know I'd do something involving colSums(!is.na(newmat))
but not sure about the rest... Thanks!
CodePudding user response:
Here is a matrix (obtained by using crossprod
is.na
) that shows which pairs fullfil your objective
> `diag<-`(crossprod(!is.na(newmat)), 0) >= 3
mark1 mark2 mark3 mark4 mark5
mark1 FALSE FALSE TRUE FALSE FALSE
mark2 FALSE FALSE FALSE FALSE FALSE
mark3 TRUE FALSE FALSE TRUE FALSE
mark4 FALSE FALSE TRUE FALSE FALSE
mark5 FALSE FALSE FALSE FALSE FALSE
as we can see, pairs (mark1, mark3)
and (mark3, mark4)
are the desired output.
CodePudding user response:
Here's one way to do it.
First, create a data frame of all the possible column pairings, excluding self-pairings:
pairs <- expand.grid(a = colnames(newmat), b = colnames(newmat))
pairs <- pairs[pairs$a != pairs$b,]
Now, for each row in this data frame, use the entries in column a and b to extract the relevant columns from newmat
. Count the number of entries that are both non-NA
in each column pair, and store it as a column in pairs
. This can all be done with an apply
call:
pairs$matches <- apply(pairs, 1, function(row) {
sum(!is.na(newmat[,row[1]]) & !is.na(newmat[,row[2]]))
})
Now filter out the rows of pairs
where there were less than 3 matches:
pairs <- pairs[pairs$matches > 2,]
Now pairs
looks like this:
pairs
#> a b matches
#> 3 mark3 mark1 3
#> 11 mark1 mark3 3
#> 14 mark4 mark3 3
#> 18 mark3 mark4 3
If we unlist the first two columns, find all the unique values and sort them, we have a vector of the column names we want, so we use this to subset the matrix to remove the redundant columns:
newmat[,sort(unique(as.character(unlist(pairs[1:2]))))]
#> mark1 mark3 mark4
#> id1 NA 1.2240818 1.7869131
#> id2 -0.23017749 0.3598138 NA
#> id3 1.55870831 0.4007715 -1.9666172
#> id4 0.07050839 0.1106827 0.7013559
#> id5 0.12928774 NA NA