I have a dataframe with various numbers. What I want, is to subset rows using all column values.
One could use dplyr to write the following code:
library(dplyr)
set.seed(1)
df <- data.frame (matrix (round (runif(500, 0, 1), digits = 1), 10, 5))
dfn <- df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.)))
Does anyone know what the base R version of this code would be? Any help is very much appreciated.
CodePudding user response:
1) sapply grepl over columns and then take those rows whose sum is positive:
df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ]
2) A variation is to use lapply instead of sapply and do.call/pmax instead of rowSums:
df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ]
3) A third way can be fashioned out of max.col
s <- sapply(df, grepl, pattern = 0.5)
df[s[cbind(1:nrow(s), max.col(s))], ]
4) Reduce with | can be used
df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ]
Benchmark
Below we compare the speeds of the various solutions. p0 is the solution in the question and is the slowest. The rest are not different according to the significance although (2) or (4) above gave the lowest runtimes depending on which metric is used.
microbenchmark(
P0 = df |> dplyr::filter_all (dplyr::any_vars (grepl (0.5,.))),
p1 = df[rowSums(sapply(df, grepl, pattern = 0.5)) > 0, ],
p2 = df[do.call("pmax", lapply(df, grepl, pattern = 0.5)) > 0, ],
p3 = { s <- sapply(df, grepl, pattern = 0.5)
df[s[cbind(1:nrow(s), max.col(s))], ]},
p4 = df[Reduce(`|`, lapply(df, grepl, pattern = 0.5)), ],
p5 = { has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
df[has_0.5, ]}
)
Unit: microseconds
expr min lq mean median uq max neval cld
P0 157038.9 186836.50 266378.971 287828.85 340121.80 387447.5 100 b
p1 629.9 673.55 1134.377 1164.15 1368.30 3080.0 100 a
p2 523.8 587.00 1052.593 1037.55 1200.60 3649.7 100 a
p3 647.8 732.60 1207.657 1232.35 1438.25 2186.1 100 a
p4 505.7 596.10 1127.210 984.90 1138.95 19122.4 100 a
p5 1039.6 1154.90 1899.574 1922.30 2180.55 8359.6 100 a
CodePudding user response:
One possibility:
has_0.5 <- apply(df, 1, function(x) any(grepl(0.5, x)))
df[has_0.5, ]