Home > Blockchain >  How to filter data frame based on criteria spanning multiple columns using dplyr in R
How to filter data frame based on criteria spanning multiple columns using dplyr in R

Time:04-07

I have a data frame containing 92 variables and 1900 observations. Essentially I have species ID as my variable of interest and relative abundance as the other variables.

Columns 69:92 (23 columns) are quality control variables that range from 0-10. I'd like to filter my data so that each species has at least 4 quality control variables that are >5. The best thing I've come up with until now is not what I want but at least filters the data on the basis that I have at least 1 variable with QC>5 and the sum of the row is larger than 100:

df_QC <-  df %>%
  filter_at(vars(contains("QC")), any_vars(. >= 5))%>% 
  rowwise()%>% 
  mutate(total = sum(c_across(69:92)))
  filter(total >99.9) 

Is there a way to solve it the way I'd like?

To reiterate, I want to select species if QC> 5 for at least 4 of the QC variables.

My data is a bit large so let's work on a smaller dataset to replicate the problem. iris data but add some quality control variables:

df <-cbind(iris, data.frame(qc1 = sample(0:10, size=150, replace=TRUE),
                            qc2 = sample(0:10, size=150, replace=TRUE),
                            qc3 = sample(0:10, size=150, replace=TRUE),
                            qc4 = sample(0:10, size=150, replace=TRUE),
                            qc5 = sample(0:10, size=150, replace=TRUE)))

#Then similarly I would do a filtering that is not really what I want:

df_QC <-  df %>%
  filter_at(vars(contains("QC")), any_vars(. >= 5))%>% 
  rowwise()%>% 
  mutate(total = sum(c_across(6:10)))%>%
filter(total >15) 

So with this example data, how could I filter out Species with QC>5 for at least 3 QCs?

Thanks in advance for any help!

CodePudding user response:

Here's a way to do it using rowSums. Basically counting the number of values >= 5 for every row, and checking if this number is higher or equal to 4.

dplyr

library(dplyr)
df %>% 
  filter(rowSums(.[str_c("qc", 1:5)] >= 5) >= 4)

base R

df[rowSums(df[paste0("qc", 1:5)] >= 5) >= 4, ]

output

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species qc1 qc2 qc3 qc4 qc5
1           4.8         3.4          1.6         0.2     setosa   6   9   5  10  10
2           5.8         4.0          1.2         0.2     setosa   9   8   3   7   5
3           5.7         3.8          1.7         0.3     setosa   4  10   9   7   7
4           5.1         3.3          1.7         0.5     setosa   7   6  10   1   7
5           4.8         3.4          1.9         0.2     setosa   1   6   6   7   7
6           5.2         3.5          1.5         0.2     setosa   5   1   6   6   7
7           5.5         3.5          1.3         0.2     setosa  10   3   6   6   6
8           4.9         3.6          1.4         0.1     setosa   8   4  10   8   6
9           4.4         3.0          1.3         0.2     setosa   1   8   5   6  10
10          5.0         3.5          1.3         0.3     setosa   9   9   5   9   1
...
  • Related