Remove substring rows from tibble-CodePudding

I have a tibble:

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

I want to remove rows that are substrings of other rows, resulting in:

result <- tibble(x = c('abcd', 'abd', 'efg'))

The solution must be quite efficient as there are ~1M rows of text.

CodePudding user response：

str_extract(df$x, "foo") == "foo" is to test if "foo" is a substring of any element in df$x. It will be always at least 1, because x is always a substring of itself. If this number is higher, it is also a substring of another element, so we need to remove them using filter(!).

library(tidyverse)

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

df %>% filter(! (x %>% map_lgl(~ sum(str_extract(df$x, .x) == .x, na.rm = TRUE) > 1)))
#> # A tibble: 3 x 1
#>   x    
#>   <chr>
#> 1 abcd 
#> 2 abd  
#> 3 efg

^{Created on 2022-02-18 by the reprex package (v2.0.0)}

CodePudding user response：

On small datasets this is slower (in those cases speed is not problem) but on bigger it is faster. Speed depends on how many unique groups there are compared to the data size.

df <- arrange(df, desc(nchar(x)))
my_strings <- df$x
i <- 1
while(i < length(my_strings)){
  
  indices <- which(str_detect(my_strings[[i]], my_strings[(i 1):length(my_strings)]))   i
  if(length(indices) > 0) my_strings <- my_strings[-indices]
  i <- i   1
}

Possible improvement but didn't test:

setDT(df)
indices_df <- df[, .(indices = list(.I)), by = x][order(-nchar(x))]
my_strings <- indices_df$x
i <- 1
while(i < length(my_strings)){

  indices <- which(str_detect(my_strings[[i]], my_strings[(i 1):length(my_strings)]))   i
  if(length(indices) > 0) my_strings <- my_strings[-indices]
  i <- i   1
}

df[indices_df[x %in% my_strings, unlist(indices)]]