I am hoping to tidy a dataframe by removing variables that are empty for any level of a grouping factor. It is fairly easy to remove columns that are entirely empty, however there appears to be no simple way to apply this selection over groups.
## Data
site<-c("A","A","A","A","A","B","B","B","B","B")
year<-c("2000","2001","2002","2003","2004","2000","2001","2002","2003","2004")
species_A<-c(1,2,3,4,5,NA,NA,NA,NA,NA)
species_B<-c(1,2,NA,4,5,NA,3,4,5,6)
species_C<-c(1,2,3,4,5,2,3,4,5,6)
dat<-data.frame(site,year,species_A,species_B,species_C)
site year species_A species_B species_C
1 A 2000 1 1 1
2 A 2001 2 2 2
3 A 2002 3 NA 3
4 A 2003 4 4 4
5 A 2004 5 5 5
6 B 2000 NA NA 2
7 B 2001 NA 3 3
8 B 2002 NA 4 4
9 B 2003 NA 5 5
10 B 2004 NA 6 6
## Remove columns with any NAs
dat %>%
group_by(site) %>%
select(where( ~!any(is.na(.x))))
## which returns
site year species_C
<chr> <chr> <dbl>
1 A 2000 1
2 A 2001 2
3 A 2002 3
4 A 2003 4
5 A 2004 5
6 B 2000 2
7 B 2001 3
8 B 2002 4
9 B 2003 5
10 B 2004 6
## Alternatively, if i try using "all" in select it will only identify fully incomplete cases.
dat %>%
group_by(site) %>%
select(where( ~!all(is.na(.x))))
## however I am trying to get...
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
It seems like this should be fairly straightforward but for whatever reason I cannot seem to get it to work.
Thanks!
CodePudding user response:
Another option:
dat %>%
select(site, dat %>%
group_by(site) %>%
summarise(across(everything(), ~!all(is.na(.x))))%>%
ungroup() %>%
select(-site) %>%
select(where(all))%>%
names())
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
CodePudding user response:
You could convert to a long format, remove the variable, then change back to a wide format.
library(tidyverse)
dat %>%
tidyr::pivot_longer(!c(site, year), names_to = "species", values_to = "values") %>%
dplyr::group_by(site, species) %>%
dplyr::mutate(allNA = all(is.na(values))) %>%
dplyr::ungroup(site) %>%
dplyr::filter(!any(allNA == TRUE)) %>%
dplyr::select(-allNA) %>%
tidyr::pivot_wider(names_from = "species", values_from = "values")
Output
# A tibble: 10 × 4
site year species_B species_C
<chr> <chr> <dbl> <dbl>
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
CodePudding user response:
We can split
by site, then use select(where(!all(is.na(.x)))
to drop the all-NA columns for every dataframe, and finally subset dat
by the intersection of column names.
library(dplyr)
library(map)
dat %>% split(site) %>%
map(\(x) select(x, where(~!all(is.na(.x)))))%>%
map(names)%>%
reduce(intersect)%>%
dat[.]
Or, for a purrr
-only solution:
library(purrr)
dat %>% split(site) %>%
map(~discard(., ~all(is.na(.x))))%>%
map(names)%>%
reduce(intersect)%>%
dat[.]
output
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6