I am faced with a problem I cannot quite figure out and I hope that someone might be able to help me with this. Its probably super easy but I am struggling and I spent too much time fiddeling around with this that I can't think clearly anymore.
I have a matrix with let's say 20 rows and 5 columns. Sometimes I have NAs in column 1. Other times in columns 1 and 2, or only 2, or 2 and 4 etc.
I want to get the row names of the observations where only the value in column 1 is missing. Then I want to look for combinations such as those where only columns 1 and 2 are missing.
Here is some example code:
set.seed(39)
df <- matrix(rnorm(100),nrow=20, ncol = 5)
df <- apply (df, 2, function(x) {x[sample(c(1:20), floor(20/8))] <- NA; x} )
auxmatrix <- is.na(df)
So far my approach has been this:
which(ifelse(auxmatrix[,1] == T, 1, 0) == 1)
This intends to look for observations where only the value in column 1 has been missing. However, this obviously gives me also those observations where column 1 AND for (example) column 2 is missing.
How can I formulate the statement that it disregards those rows where 1 and 2 are missing?
After that I also want to look for combinations such as:
which(ifelse(auxmatrix[,3] == T & auxmatrix[,4] == T, 1, 0) == 1)
to find observations where there are NAs in columns 3 and 4. Of course, his then also gives me wrong observations because there might be observations where 3,4, and 5 are missing.
I hope I could express my problem properly. I am very grateful for your ideas.
Best regards, tony13
CodePudding user response:
I think it is simpler with which
only:
For row numbers in column 1:
which(is.na(df[,1])
For combinations of columns:
which(is.na(df[,2]) & is.na(df[,3]))
This would return row numbers in which is NA
in columns 2 and 3 for example.
CodePudding user response:
Using rowSums
. Let's use a simpler data frame.
df
# V1 V2 V3 V4 V5
# 1 1 1 1 1 1
# 2 1 1 1 NA 1
# 3 NA 1 1 1 1
# 4 1 NA 1 1 1
# 5 NA 1 1 1 1
# 6 NA NA 1 1 1
# 7 NA NA 1 NA 1
# 8 NA NA 1 1 1
# 9 NA NA NA NA NA
Now we use rowSums
of the boolean matrix that is.na
yields.
## NA only in 1st and 2nd row
which(rowSums(is.na(df[, 1:2])) == 2 & !rowSums(is.na(df[, -(1:2)])))
# [1] 6 8
Using just one column (i.e. first column) is actually a special case,
## NA only in 1st row
which(rowSums(is.na(df[, 1, drop=FALSE])) & !rowSums(is.na(df[, -1])))
# [1] 3 5
which simplifies to:
## NA only in 1st row
which(is.na(df[, 1]) & !rowSums(is.na(df[, -1])))
# [1] 3 5
Data:
df <- structure(list(V1 = c(1L, 1L, NA, 1L, NA, NA, NA, NA, NA), V2 = c(1L,
1L, 1L, NA, 1L, NA, NA, NA, NA), V3 = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA), V4 = c(1L, NA, 1L, 1L, 1L, 1L, NA, 1L, NA), V5 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, NA)), class = "data.frame", row.names = c(NA,
-9L))