Home > Enterprise >  Select set of columns so that each row has at least one non-NA entry
Select set of columns so that each row has at least one non-NA entry

Time:04-30

I have a large number of variables (columns), but each has missing values for some of the observations (rows). How can I get a set (or all sets) of columns so that each row has at least one non-NA value?

Example:

> df <- data.frame(a=c(1,NA,NA,1,NA), b=c(NA,1,NA,1,NA), c=c(1,NA,NA,NA,1), d=c(1,1,1,1,NA))
> df
   a  b  c  d
1  1 NA  1  1
2 NA  1 NA  1
3 NA NA NA  1
4  1  1 NA  1
5 NA NA  1 NA

Here I would like to obtain columns c and d, as these combined result in at least one non-NA observation for each row. I thought of brute-forcing all possible variable combinations, but it is not feasible in my case given a large number of variables. Is there a more efficient solution?

CodePudding user response:

Using a while loop, this should work to get the minimum set of variables with at least one non-NA per row.

best <- function(df){
  best <- which.max(colSums(sapply(df, complete.cases)))
  while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
    best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
  }
  best
}

testing

best(df)
#d c 
#4 3

df[best(df)]
#   d  c
#1  1  1
#2  1 NA
#3  1 NA
#4  1 NA
#5 NA  1

First, select the column with the least NAs (stored in best). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.

  • Related