Drop rows of data if two conditions don't exist in a column in R-CodePudding

I have the 52K row dataframe. I want to drop all genes that dont have both Light and Healthy in the group column. I would like to filter these out. I am not really sure how to do this quickly. I was thinking that tidyverse or dplyr might be useful.

data
         gene      id   group           snp ref total ref_condition
11080    ZZZ3 Healthy Healthy chr1:77664558   1     5       Healthy
22772    ZZZ3 Healthy Healthy chr1:77557488   2     5       Healthy
1632    ZZEF1 Healthy Healthy chr17:4086375   4     7       Healthy
13357   ZZEF1 Healthy Healthy chr17:4033235   7     9       Healthy
15312  ZYG11B Healthy Healthy chr1:52769202   1     2       Healthy
145341 ZYG11B   Light   Light chr1:52779185   1     4       Healthy

Wanted output
             gene      id   group           snp ref total ref_condition
    15312  ZYG11B Healthy Healthy chr1:52769202   1     2       Healthy
    145341 ZYG11B   Light   Light chr1:52779185   1     4       Healthy

CodePudding user response：

You could use two anys per group_by like this:

library(dplyr)
data %>%
  group_by(gene) %>%
  filter(any(group == "Healthy") & any(group == "Light"))
#> # A tibble: 2 × 7
#> # Groups:   gene [1]
#>   gene   id      group   snp             ref total ref_condition
#>   <chr>  <chr>   <chr>   <chr>         <int> <int> <chr>        
#> 1 ZYG11B Healthy Healthy chr1:52769202     1     2 Healthy      
#> 2 ZYG11B Light   Light   chr1:52779185     1     4 Healthy

^{Created on 2023-01-23 with reprex v2.0.2}

CodePudding user response：

Simply:

data%>%
  group_by(gene)%>%
  filter(sum(group=="Light")>=1 & sum(group=="Healthy")>=1)%>%
  ungroup

  gene   id      group   snp             ref total ref_condition
  <fct>  <fct>   <fct>   <fct>         <int> <int> <fct>        
1 ZYG11B Healthy Healthy chr1:52769202     1     2 Healthy      
2 ZYG11B Light   Light   chr1:52779185     1     4 Healthy

Original answer: We can count number of light and healthy and filter rows if n_light>=1 & n_healthy>=1

library(dplyr)
data%>%
  group_by(gene)%>%
  mutate(n_light=sum(group=="Light"),
         n_healthy=sum(group=="Healthy"))%>%
  filter(n_light>=1 & n_healthy>=1)%>%
  ungroup

  gene   id      group   snp             ref total ref_condition n_light n_healthy
  <fct>  <fct>   <fct>   <fct>         <int> <int> <fct>           <int>     <int>
1 ZYG11B Healthy Healthy chr1:52769202     1     2 Healthy             1         1
2 ZYG11B Light   Light   chr1:52779185     1     4 Healthy             1         1

and remove auxiliary columns n_light,n_healthy by %>%select(-n_light,n_healthy), if needed