Delete specific column if all values are NA in R-CodePudding

I've got a database query running from R which can have multiple columns filled with NA, some of these columns I need further down the line but one specific column can be dropped if all values are NA.

I usually use purrr::discard(~all(is.na(.))) to drop the columns which are all NA, but since this dataframe can contain multiple columns with NA where I only want to remove one I'm struggling to make this specific to the column in a tidyverse solution.

I've currently got this workaround:

  if(sum(is.na(Orders$Originator)) == nrow(Orders)) {
    
    Orders <- Orders %>%
      select(-Originator)
    
  }

But it would improve the readability if I can have this in a tidyverse solution. Hope someone can be of help!

Thanks!

CodePudding user response：

The canonical tidyverse way to address this problem would be to make use of a predicate function used within select(where(...)) and combine this with selection by variable name.

First we could write a custom predicate function to use in where which selects only columns that contain only NAs.

# custom predicate function
all_na <- function(x) {
  all(is.na(x))
}

We can use this function together with a boolean expression saying we don't want to select y if (read AND &) it is all_na:

library(dplyr)

df <- data.frame(
  x = c(1,2,NA),
  y = NA,
  z = c(3,4,5)
)

df %>% 
  select(!(y & where(all_na)))
#>    x z
#> 1  1 3
#> 2  2 4
#> 3 NA 5

To check whether this is really working let's redefine y so that it contains not only NAs and we will see that this time it's not deselected:

df2 <- data.frame(
  x = c(1,2,NA),
  y = c(1,2,NA),
  z = c(3,4,5)
)

df2 %>% 
  select(!(y & where(all_na)))
#>    x  y z
#> 1  1  1 3
#> 2  2  2 4
#> 3 NA NA 5

Instead of a custom function we can use a lambda function inside where:

df %>% 
  select(!(y & where(~ all(is.na(.x)))))

^{Created on 2021-12-07 by the reprex package (v0.3.0)}

In the larger tidyverse we could also use purrr::lmap_at and select y with the .at argument and then create a lambda function saying if all(is.na(.x)) then use an empty list() (= drop the column) otherwise keep the column .x:

library(purrr)
library(dplyr)

df %>% 
  lmap_at("y", ~ if(all(is.na(.x))) list() else .x)
#> # A tibble: 3 x 2
#>       x     z
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2     4
#> 3    NA     5

^{Created on 2021-12-07 by the reprex package (v2.0.1)}

CodePudding user response：

Using example data:

df <- data.frame(
  x = c(1,2,NA),
  y = NA,
  z = c(3,4,5)
)

Here column y is the target column to check if all is.na. Your if and else will be contained in curly braces. The braces will suppress the pipe from using the first argument in a function. Note the else will keep your data frame in the pipe if the condition is false.

library(tidyverse)

df %>%
  { if (all(is.na(.$y))) select(., -y) else . }

Output

CodePudding user response：

It seems you are trying to mix selection by name (i.e. only specific column) and by logical. Both can be done very well separately in tidyverse (use tidy selectors or where), but I am not sure how to combine them!

So here is a dirty solution that doesn't use either:

library(dplyr, warn.conflicts = FALSE)
df <- data.frame(
  x = c(1,2,NA),
  y = NA,
  y2 = NA,
  z = c(3,4,5)
)

df %>% 
  select(-which(colnames(.)=="y" & sapply(., \(x) all(is.na(x)))))
#>    x y2 z
#> 1  1 NA 3
#> 2  2 NA 4
#> 3 NA NA 5

^{Created on 2021-12-07 by the reprex package (v2.0.1)}