Less stringent filter with if

Is there a way to use if_all less stringently in the filter function of dplyr?

# make this example reproducible
set.seed(1)

# create a data frame with 10 random numbers between 1 and 20
df <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))

# define column names
names(df) <- c('A.length', 'b.length','C.length','D.length','E.length')


> df
   A.length  b.length  C.length  D.length  E.length
1  6.044665 18.069404  4.913517 10.456286 18.759399
2  8.070354 18.948830  4.354578 14.634752  5.030708
3 11.884214 13.555158 14.053434 19.846216 13.381802
4 18.255948 12.953167  8.297971  8.220668  3.385547
5  4.831957  2.173939 15.626987 15.771459  6.077193

I want to use a filter function to filter rows where all columns or all columns except 1 are above a specific value.

Something like the following but row number 3 should be included since only 1 column doesn't fulfill the criteria.

df %>%
  filter(if_all(ends_with("length"), ~ .x > 13))

# [1] A.length b.length C.length D.length E.length
# <0 rows> (or 0-length row.names)

CodePudding user response：

Maybe if_any and if_all cannot fit your case. A workaround is to use rowSums(across(...)) >= n to determine whether there are n or more values greater than 13 in each row.

df %>%
  filter(rowSums(across(ends_with("length")) > 13) >= 4)

#   A.length b.length C.length D.length E.length
# 1 11.88421 13.55516 14.05343 19.84622  13.3818

The 3rd row are extracted from df:

df
#    A.length  b.length  C.length  D.length  E.length
# 1  6.044665 18.069404  4.913517 10.456286 18.759399
# 2  8.070354 18.948830  4.354578 14.634752  5.030708
# 3 11.884214 13.555158 14.053434 19.846216 13.381802
# 4 18.255948 12.953167  8.297971  8.220668  3.385547
# 5  4.831957  2.173939 15.626987 15.771459  6.077193

CodePudding user response：

This may not be the best of the approach but an alternate

library(tidyverse)

df %>% rowwise %>% 
mutate(across(ends_with('length'), ~ ifelse(.x>13,1,NA), .names = 'log{col}'),
                     sumx=sum(across(starts_with('log')), na.rm = T)
                     ) %>% filter(sumx>=4) %>% select(-starts_with('log'),-sumx)

^{Created on 2023-01-19 with reprex v2.0.2}

# A tibble: 1 × 5
# Rowwise: 
  A.length b.length C.length D.length E.length
     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1     11.9     13.6     14.1     19.8     13.4