Translating filter_all(any vars()) to base R-CodePudding

I have a dataframe with various numbers. What I want, is to subset rows using all column values.

One could use dplyr to write the following code:

library(dplyr)

set.seed(1)

df <- data.frame (matrix (round (runif(500, 0, 1), digits = 1), 10, 5))

dfn <- df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.)))

Does anyone know what the base R version of this code would be? Any help is very much appreciated.

CodePudding user response：

1) sapply grepl over columns and then take those rows whose sum is positive:

df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ]

2) A variation is to use lapply instead of sapply and do.call/pmax instead of rowSums:

df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ]

3) A third way can be fashioned out of max.col

s <- sapply(df, grepl, pattern = 0.5)
df[s[cbind(1:nrow(s), max.col(s))], ]

4) Reduce with | can be used

df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ]

Benchmark

Below we compare the speeds of the various solutions. p0 is the solution in the question and is the slowest. The rest are not different according to the significance although (2) or (4) above gave the lowest runtimes depending on which metric is used.

microbenchmark(
P0 = df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.))),
p1 = df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ],
p2 = df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ],
p3 = { s <- sapply(df, grepl, pattern = 0.5)
       df[s[cbind(1:nrow(s), max.col(s))], ]},
p4 = df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ],
p5 = { has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
         df[has_0.5, ]}
  )
Unit: microseconds
 expr      min        lq       mean    median        uq      max neval cld
   P0 157038.9 186836.50 266378.971 287828.85 340121.80 387447.5   100   b
   p1    629.9    673.55   1134.377   1164.15   1368.30   3080.0   100  a 
   p2    523.8    587.00   1052.593   1037.55   1200.60   3649.7   100  a 
   p3    647.8    732.60   1207.657   1232.35   1438.25   2186.1   100  a 
   p4    505.7    596.10   1127.210    984.90   1138.95  19122.4   100  a 
   p5   1039.6   1154.90   1899.574   1922.30   2180.55   8359.6   100  a

CodePudding user response：

One possibility:

has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
df[has_0.5, ]