filter using dplyr to filter where columns and rows match values-CodePudding

I have an example stripped down dataset combined using this:

library(tidyverse)

composite <- inner_join(shared, taxonomy, by="otu") %>%
group_by(group, taxonomy) %>%
filter(count > 3)

The data looks like this:

group otu    count taxonomy  
<fct> <fct>  <dbl> <chr>  
1 17VSD otu001  4559 Escherichia-Shigella  
2 17VSD otu002   870 Enterobacteriaceae_unclassified
3 17VSD otu020    63 Cupriavidus  
4 17VSD otu072    24 Escherichia-Shigella  
5 17VSD otu080    16 Escherichia-Shigella  
6 17VSD otu205     4 Escherichia-Shigella  
7 YG1   otu001    15 Escherichia-Shigella  
8 YG1   otu002    15 Enterobacteriaceae_unclassified
9 YG1   otu004   504 Corynebacterium  
10 YG1   otu006   500 Cutibacterium

In the group column there are 3 variables.

I'm having a lot of trouble with the syntax to get the next filter sequence. I want each group factor to have only the same factors under the otu column. In the data we can see row 9 - YG1 otu004 504 Corynebacterium. I would like to remove this row because group 17VSD does not have otu004. This will also have to be true for the other group (YG2). of course this will get very complicated with the full data set where there are 20 group factors and millions of otu factors.

I've tried expanding filter(count > 3 & ...) but that doesn't seem to be the right direction. Otherwise just lots of searching through examples. My problem may also be because I'm not using the correct language to help solve the issue.

CodePudding user response：

As Ben had pointed out I can use n_distinct() to solve the issue. Thank you Ben!

to get what I needed for the dummy set:

composite <- inner_join(shared, taxonomy, by="otu") %>% 
  group_by(group, taxonomy) %>% 
  filter(count >3) %>%
  group_by(otu)  %>%
  filter(n_distinct(group) == 3)

I also attempted Pablo Serrati's answer. It works, but its a little but complicated. Thank you Pablo

CodePudding user response：

I don't know if this is the simple way, but... you can reshape your data with pivot_, then use this table to select columns with apply, and make a new reshape.

# Example data 
composite <- data.frame(group = rep(c("a", "b", "c"), c(9, 11, 5)),
                        otu = sample(paste0("otu00", 0:9),
                                     size = 25,  replace =  T),
                        count = sample(1:20, size = 25, replace = T))

# Deleting possible duplicated group otu
composite <- composite[!duplicated(composite[c("group", "otu")]), ]  

# reshape data
composite_reshape <- composite |> 
  arrange(group, otu) |> 
  pivot_wider(id_cols = "group", names_from = "otu", values_from = "count")

# Select complete otu columns
composite_reduce <- composite_reshape[!apply(composite_reshape , 2, anyNA)]

# New reshape
composite_final <- composite_reduce |> 
  pivot_longer(cols = -group )

composite_final