I have the 52K row dataframe. I want to drop all genes that dont have both Light and Healthy in the group column. I would like to filter these out. I am not really sure how to do this quickly. I was thinking that tidyverse or dplyr might be useful.
data
gene id group snp ref total ref_condition
11080 ZZZ3 Healthy Healthy chr1:77664558 1 5 Healthy
22772 ZZZ3 Healthy Healthy chr1:77557488 2 5 Healthy
1632 ZZEF1 Healthy Healthy chr17:4086375 4 7 Healthy
13357 ZZEF1 Healthy Healthy chr17:4033235 7 9 Healthy
15312 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
145341 ZYG11B Light Light chr1:52779185 1 4 Healthy
Wanted output
gene id group snp ref total ref_condition
15312 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
145341 ZYG11B Light Light chr1:52779185 1 4 Healthy
CodePudding user response:
You could use two any
s per group_by
like this:
library(dplyr)
data %>%
group_by(gene) %>%
filter(any(group == "Healthy") & any(group == "Light"))
#> # A tibble: 2 × 7
#> # Groups: gene [1]
#> gene id group snp ref total ref_condition
#> <chr> <chr> <chr> <chr> <int> <int> <chr>
#> 1 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
#> 2 ZYG11B Light Light chr1:52779185 1 4 Healthy
Created on 2023-01-23 with reprex v2.0.2
CodePudding user response:
Simply:
data%>%
group_by(gene)%>%
filter(sum(group=="Light")>=1 & sum(group=="Healthy")>=1)%>%
ungroup
gene id group snp ref total ref_condition
<fct> <fct> <fct> <fct> <int> <int> <fct>
1 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
2 ZYG11B Light Light chr1:52779185 1 4 Healthy
Original answer:
We can count number of light and healthy and filter rows if n_light>=1 & n_healthy>=1
library(dplyr)
data%>%
group_by(gene)%>%
mutate(n_light=sum(group=="Light"),
n_healthy=sum(group=="Healthy"))%>%
filter(n_light>=1 & n_healthy>=1)%>%
ungroup
gene id group snp ref total ref_condition n_light n_healthy
<fct> <fct> <fct> <fct> <int> <int> <fct> <int> <int>
1 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy 1 1
2 ZYG11B Light Light chr1:52779185 1 4 Healthy 1 1
and remove auxiliary columns n_light,n_healthy
by %>%select(-n_light,n_healthy), if needed