Finding combination of NAs in matrix (exclusive AND?)-CodePudding

I am faced with a problem I cannot quite figure out and I hope that someone might be able to help me with this. Its probably super easy but I am struggling and I spent too much time fiddeling around with this that I can't think clearly anymore.

I have a matrix with let's say 20 rows and 5 columns. Sometimes I have NAs in column 1. Other times in columns 1 and 2, or only 2, or 2 and 4 etc.

I want to get the row names of the observations where only the value in column 1 is missing. Then I want to look for combinations such as those where only columns 1 and 2 are missing.

Here is some example code:

set.seed(39)

df <- matrix(rnorm(100),nrow=20, ncol = 5)
df <- apply (df, 2, function(x) {x[sample(c(1:20), floor(20/8))] <- NA; x} )

auxmatrix <- is.na(df)

So far my approach has been this:

which(ifelse(auxmatrix[,1] == T, 1, 0) == 1)

This intends to look for observations where only the value in column 1 has been missing. However, this obviously gives me also those observations where column 1 AND for (example) column 2 is missing.

How can I formulate the statement that it disregards those rows where 1 and 2 are missing?

After that I also want to look for combinations such as:

which(ifelse(auxmatrix[,3] == T & auxmatrix[,4] == T, 1, 0) == 1)

to find observations where there are NAs in columns 3 and 4. Of course, his then also gives me wrong observations because there might be observations where 3,4, and 5 are missing.

I hope I could express my problem properly. I am very grateful for your ideas.

Best regards, tony13

CodePudding user response：

I think it is simpler with which only:

For row numbers in column 1: which(is.na(df[,1])

For combinations of columns: which(is.na(df[,2]) & is.na(df[,3])) This would return row numbers in which is NA in columns 2 and 3 for example.

CodePudding user response：

Using rowSums. Let's use a simpler data frame.

df
#   V1 V2 V3 V4 V5
# 1  1  1  1  1  1
# 2  1  1  1 NA  1
# 3 NA  1  1  1  1
# 4  1 NA  1  1  1
# 5 NA  1  1  1  1
# 6 NA NA  1  1  1
# 7 NA NA  1 NA  1
# 8 NA NA  1  1  1
# 9 NA NA NA NA NA

Now we use rowSums of the boolean matrix that is.na yields.

## NA only in 1st and 2nd row
which(rowSums(is.na(df[, 1:2])) == 2 & !rowSums(is.na(df[, -(1:2)])))
# [1] 6 8

Using just one column (i.e. first column) is actually a special case,

## NA only in 1st row
which(rowSums(is.na(df[, 1, drop=FALSE])) & !rowSums(is.na(df[, -1])))
# [1] 3 5

which simplifies to:

## NA only in 1st row
which(is.na(df[, 1]) & !rowSums(is.na(df[, -1])))
# [1] 3 5

Data:

df <- structure(list(V1 = c(1L, 1L, NA, 1L, NA, NA, NA, NA, NA), V2 = c(1L, 
1L, 1L, NA, 1L, NA, NA, NA, NA), V3 = c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, NA), V4 = c(1L, NA, 1L, 1L, 1L, 1L, NA, 1L, NA), V5 = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, NA)), class = "data.frame", row.names = c(NA, 
-9L))