Home > Net >  Remove columns that are all NA for at least one level of a factor
Remove columns that are all NA for at least one level of a factor

Time:12-04

I am hoping to tidy a dataframe by removing variables that are empty for any level of a grouping factor. It is fairly easy to remove columns that are entirely empty, however there appears to be no simple way to apply this selection over groups.

## Data

site<-c("A","A","A","A","A","B","B","B","B","B")
year<-c("2000","2001","2002","2003","2004","2000","2001","2002","2003","2004")
species_A<-c(1,2,3,4,5,NA,NA,NA,NA,NA)
species_B<-c(1,2,NA,4,5,NA,3,4,5,6)
species_C<-c(1,2,3,4,5,2,3,4,5,6)

dat<-data.frame(site,year,species_A,species_B,species_C)


  site year species_A species_B species_C
1     A 2000         1         1         1
2     A 2001         2         2         2
3     A 2002         3        NA         3
4     A 2003         4         4         4
5     A 2004         5         5         5
6     B 2000        NA        NA         2
7     B 2001        NA         3         3
8     B 2002        NA         4         4
9     B 2003        NA         5         5
10    B 2004        NA         6         6
 


## Remove columns with any NAs

dat %>% 
  group_by(site) %>%
  select(where( ~!any(is.na(.x))))


## which returns 

   site  year  species_C
   <chr> <chr>     <dbl>
 1 A     2000          1
 2 A     2001          2
 3 A     2002          3
 4 A     2003          4
 5 A     2004          5
 6 B     2000          2
 7 B     2001          3
 8 B     2002          4
 9 B     2003          5
10 B     2004          6



## Alternatively, if i try using "all" in select it will only identify fully incomplete cases.

dat %>% 
  group_by(site) %>%
  select(where( ~!all(is.na(.x))))



## however I am trying to get...
 
   site year species_B species_C
1     A 2000         1         1
2     A 2001         2         2
3     A 2002        NA         3
4     A 2003         4         4
5     A 2004         5         5
6     B 2000        NA         2
7     B 2001         3         3
8     B 2002         4         4
9     B 2003         5         5
10    B 2004         6         6

It seems like this should be fairly straightforward but for whatever reason I cannot seem to get it to work.

Thanks!

CodePudding user response:

Another option:

 dat %>%
  select(site, dat %>%
  group_by(site) %>%
  summarise(across(everything(), ~!all(is.na(.x))))%>%
  ungroup() %>%
  select(-site) %>%
  select(where(all))%>%
  names())

   site year species_B species_C
1     A 2000         1         1
2     A 2001         2         2
3     A 2002        NA         3
4     A 2003         4         4
5     A 2004         5         5
6     B 2000        NA         2
7     B 2001         3         3
8     B 2002         4         4
9     B 2003         5         5
10    B 2004         6         6

CodePudding user response:

You could convert to a long format, remove the variable, then change back to a wide format.

library(tidyverse)

dat %>%
  tidyr::pivot_longer(!c(site, year), names_to = "species", values_to = "values") %>%
  dplyr::group_by(site, species) %>%
  dplyr::mutate(allNA = all(is.na(values))) %>%
  dplyr::ungroup(site) %>%
  dplyr::filter(!any(allNA == TRUE)) %>%
  dplyr::select(-allNA) %>%
  tidyr::pivot_wider(names_from = "species", values_from = "values")

Output

# A tibble: 10 × 4
   site  year  species_B species_C
   <chr> <chr>     <dbl>     <dbl>
 1 A     2000          1         1
 2 A     2001          2         2
 3 A     2002         NA         3
 4 A     2003          4         4
 5 A     2004          5         5
 6 B     2000         NA         2
 7 B     2001          3         3
 8 B     2002          4         4
 9 B     2003          5         5
10 B     2004          6         6

CodePudding user response:

We can split by site, then use select(where(!all(is.na(.x))) to drop the all-NA columns for every dataframe, and finally subset dat by the intersection of column names.

library(dplyr)
library(map)

dat %>% split(site) %>%
    map(\(x) select(x, where(~!all(is.na(.x)))))%>%
    map(names)%>%
    reduce(intersect)%>%
    dat[.]

Or, for a purrr-only solution:

library(purrr)

dat %>% split(site) %>%
    map(~discard(., ~all(is.na(.x))))%>%
    map(names)%>%
    reduce(intersect)%>%
    dat[.]

output

   site year species_B species_C
1     A 2000         1         1
2     A 2001         2         2
3     A 2002        NA         3
4     A 2003         4         4
5     A 2004         5         5
6     B 2000        NA         2
7     B 2001         3         3
8     B 2002         4         4
9     B 2003         5         5
10    B 2004         6         6
  • Related