Drop columns when there are many missingness in R-CodePudding

I am trying to drop some columns that have less than 5 valid values. Here is an example dataset.

df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
                 i1 = c(0,1,1,1,1,0,0,1,NA,1),
                 i2 = c(1,0,0,1,0,1,1,0,0,NA),
                 i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
                 i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))

> df
   id i1 i2 i3 i4
1   1  0  1 NA NA
2   2  1  0 NA  1
3   3  1  0 NA NA
4   4  1  1 NA NA
5   5  1  0 NA NA
6   6  0  1 NA NA
7   7  0  1 NA NA
8   8  1  0 NA NA
9   9 NA  0 NA 1
10 10  1 NA  0 NA

in this case, columns i3 and i4 needs to be dropped from the data frame.

How can I get the desired dataset below:

> df
   id i1 i2 
1   1  0  1 
2   2  1  0 
3   3  1  0
4   4  1  1 
5   5  1  0 
6   6  0  1 
7   7  0  1 
8   8  1  0 
9   9 NA  0 
10 10  1 NA

CodePudding user response：

You can keep cols with at least 5 non-missing values with:

df[colSums(!is.na(df)) >= 5]

CodePudding user response：

Can use discard from the purrr package:

library(purrr)
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10),
                 i1 = c(0,1,1,1,1,0,0,1,NA,1),
                 i2 = c(1,0,0,1,0,1,1,0,0,NA),
                 i3 = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,0),
                 i4 = c(NA,1,NA,NA,NA,NA,NA,NA,1,NA))

df %>% 
  discard(~ sum(!is.na(.))<5)
#>    id i1 i2
#> 1   1  0  1
#> 2   2  1  0
#> 3   3  1  0
#> 4   4  1  1
#> 5   5  1  0
#> 6   6  0  1
#> 7   7  0  1
#> 8   8  1  0
#> 9   9 NA  0
#> 10 10  1 NA

^{Created on 2022-11-10 with reprex v2.0.2}

While this is likely slower than base R methods (for datasets with extremely many columns > 1000), I generally feel the readability of the code is far superior. In addition, it is easy to do more complicated statements.

CodePudding user response：

Using R base, another approach...

> df[, sapply(df, function(x) sum(is.na(x))) < 5]
   id i1 i2
1   1  0  1
2   2  1  0
3   3  1  0
4   4  1  1
5   5  1  0
6   6  0  1
7   7  0  1
8   8  1  0
9   9 NA  0
10 10  1 NA

CodePudding user response：

A performance comparison of the different answers given in this post:

funs = list(
  colSums = function(df){df[colSums(!is.na(df)) >= nrow/10]},
  sapply = function(df){df[, sapply(df, function(x) sum(!is.na(x))) >= nrow/10]},
  discard = function(df){df %>% discard(~ sum(!is.na(.)) < nrow/10)},
  mutate = function(df){df %>% mutate(across(where(~ sum(!is.na(.)) < nrow/10), ~ NULL))},
  select = function(df){df %>% select(where(~ sum(!is.na(.)) >= nrow/10))})

ncol = 10000
nrow = 100
df = replicate(ncol, sample(c(1:9, NA), nrow, TRUE)) %>% as_tibble()

avrtime = map_dbl(funs, function(f){
  duration = c()
  for(i in 1:10){
    t1 = Sys.time()
    f(df)
    t2 = Sys.time()
    duration[i] = as.numeric(t2 - t1)}
  
  return(mean(duration))})

avrtime[order(avrtime)]

The average time taken by each (in seconds):

   colSums     sapply    discard     select     mutate 
0.04510500 0.04692972 0.29207475 0.29451160 0.31755514

CodePudding user response：

Using select

library(dplyr)
df %>%
   select(where(~ sum(complete.cases(.x)) >=5))

-output

   id i1 i2
1   1  0  1
2   2  1  0
3   3  1  0
4   4  1  1
5   5  1  0
6   6  0  1
7   7  0  1
8   8  1  0
9   9 NA  0
10 10  1 NA

Or in base R

Filter(\(x) sum(complete.cases(x)) >=5 , df)