In the string below (which is a column in a df) I want to extract strings in which TRUE is present at least two times. I guess I could do some strsplit and then detect duplicates, but is there a method to do it directly?
head(df$Filter)
[1] "FALSE_TRUE_FALSE_FALSE" "FALSE_TRUE_FALSE_FALSE" "FALSE_TRUE_TRUE_FALSE" "FALSE_TRUE_FALSE_FALSE" "FALSE_TRUE_FALSE_FALSE"
[6] "FALSE_TRUE_FALSE_FALSE"
out in this example:
FALSE_TRUE_TRUE_FALSE
CodePudding user response:
We can use str_count
library(dplyr)
library(stringr)
df %>%
filter(str_count(Filter, "TRUE") > 1)
CodePudding user response:
We can just look for TRUE (something) TRUE
.
df[grepl("TRUE.*TRUE", df$Filter),,drop=FALSE]
# Filter
# 3 FALSE_TRUE_TRUE_FALSE
This can use stringr::str_detect
just as easily:
stringr::str_detect(df$Filter, "TRUE.*TRUE")
# [1] FALSE FALSE TRUE FALSE FALSE FALSE
Benchmarking here might be premature (with a small dataset), but counting how many times TRUE
occurs is relatively expensive:
bench::mark(
grepl = dplyr::filter(df, grepl("TRUE.*TRUE", Filter)),
str_detect = dplyr::filter(df, stringr::str_detect(Filter, "TRUE.*TRUE")),
str_count = dplyr::filter(df, stringr::str_count(Filter, "TRUE") == 2)
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis> <lis>
# 1 grepl 556.3us 635.3us 1483. 2.11KB 6.27 709 3 478ms <df [1 x 1]> <Rprof~ <ben~ <tib~
# 2 str_detect 585.7us 672us 1266. 2.11KB 6.28 605 3 478ms <df [1 x 1]> <Rprof~ <ben~ <tib~
# 3 str_count 4.46ms 5.16ms 188. 3.66KB 9.04 83 4 442ms <df [1 x 1]> <Rprof~ <ben~ <tib~
(It appears that somewhere on the scale of 50,000 rows is where stringr::str_count
's performance is on parity with grepl
. Now I'm curious why that is the case ... :-)