Home > Back-end >  detect duplicated words within string
detect duplicated words within string

Time:10-25

In the string below (which is a column in a df) I want to extract strings in which TRUE is present at least two times. I guess I could do some strsplit and then detect duplicates, but is there a method to do it directly?

head(df$Filter)
[1] "FALSE_TRUE_FALSE_FALSE" "FALSE_TRUE_FALSE_FALSE" "FALSE_TRUE_TRUE_FALSE"  "FALSE_TRUE_FALSE_FALSE" "FALSE_TRUE_FALSE_FALSE"
[6] "FALSE_TRUE_FALSE_FALSE"

out in this example:

FALSE_TRUE_TRUE_FALSE

CodePudding user response:

We can use str_count

library(dplyr)
library(stringr)
df %>%
    filter(str_count(Filter, "TRUE") > 1)

CodePudding user response:

We can just look for TRUE (something) TRUE.

df[grepl("TRUE.*TRUE", df$Filter),,drop=FALSE]
#                  Filter
# 3 FALSE_TRUE_TRUE_FALSE

This can use stringr::str_detect just as easily:

stringr::str_detect(df$Filter, "TRUE.*TRUE")
# [1] FALSE FALSE  TRUE FALSE FALSE FALSE

Benchmarking here might be premature (with a small dataset), but counting how many times TRUE occurs is relatively expensive:

bench::mark(
  grepl = dplyr::filter(df, grepl("TRUE.*TRUE", Filter)),
  str_detect = dplyr::filter(df, stringr::str_detect(Filter, "TRUE.*TRUE")),
  str_count = dplyr::filter(df, stringr::str_count(Filter, "TRUE") == 2)
)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory  time  gc   
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>       <list>  <lis> <lis>
# 1 grepl       556.3us  635.3us     1483.    2.11KB     6.27   709     3      478ms <df [1 x 1]> <Rprof~ <ben~ <tib~
# 2 str_detect  585.7us    672us     1266.    2.11KB     6.28   605     3      478ms <df [1 x 1]> <Rprof~ <ben~ <tib~
# 3 str_count    4.46ms   5.16ms      188.    3.66KB     9.04    83     4      442ms <df [1 x 1]> <Rprof~ <ben~ <tib~

(It appears that somewhere on the scale of 50,000 rows is where stringr::str_count's performance is on parity with grepl. Now I'm curious why that is the case ... :-)

  • Related