I have a large number of variables (columns), but each has missing values for some of the observations (rows). How can I get a set (or all sets) of columns so that each row has at least one non-NA value?
Example:
> df <- data.frame(a=c(1,NA,NA,1,NA), b=c(NA,1,NA,1,NA), c=c(1,NA,NA,NA,1), d=c(1,1,1,1,NA))
> df
a b c d
1 1 NA 1 1
2 NA 1 NA 1
3 NA NA NA 1
4 1 1 NA 1
5 NA NA 1 NA
Here I would like to obtain columns c
and d
, as these combined result in at least one non-NA observation for each row. I thought of brute-forcing all possible variable combinations, but it is not feasible in my case given a large number of variables. Is there a more efficient solution?
CodePudding user response:
Using a while
loop, this should work to get the minimum set of variables with at least one non-NA per row.
best <- function(df){
best <- which.max(colSums(sapply(df, complete.cases)))
while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
}
best
}
testing
best(df)
#d c
#4 3
df[best(df)]
# d c
#1 1 1
#2 1 NA
#3 1 NA
#4 1 NA
#5 NA 1
First, select the column with the least NAs (stored in best
). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.