How to strip down comma-separated strings to unique substrings-CodePudding

I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:

x <- c("Anna & x, Anna & x", #
       "Alb, Berta 222, Alb", 
       "Al Pacino", 
       "Abb cd xy, Abb cd xy, C123, C123, B")

I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:

library(stringr)
str_extract_all(x, "([^,] )(?!.*\\1)")
[[1]]
[1] " Anna & x"

[[2]]
[1] " Berta 222" " Alb"      

[[3]]
[1] "Al Pacino"

[[4]]
[1] " Abb cd xy" " C123"      " B"

How can the pattern be refined so that no unwanted whitespace gets extracted?

Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb"       "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123"      "B"

EDIT:

Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)

str_extract_all(x, "((?!\\s)[^,] )(?!.*\\1)")

CodePudding user response：

Change your pattern to the one below:

str_extract_all(x, "(\\b[^,] )(?!.*\\1)")
[[1]]
[1] "Anna & x"

[[2]]
[1] "Berta 222" "Alb"      

[[3]]
[1] "Al Pacino"

[[4]]
[1] "Abb cd xy" "C123"      "B"

CodePudding user response：

You can use str_split to get the individual substrings, followed by unique to remove repeated strings. For example:

library(tidyverse)

str_split(x, ", ?") %>% map(unique)
#> [[1]]
#> [1] "Anna & x"
#> 
#> [[2]]
#> [1] "Alb"       "Berta 222"
#> 
#> [[3]]
#> [1] "Al Pacino"
#> 
#> [[4]]
#> [1] "Abb cd xy" "C123"      "B"

If you want the output as a single vector of unique strings, you could do:

str_split(x, ", ?") %>% unlist %>% unique
#> [1] "Anna & x"  "Alb"       "Berta 222" "Al Pacino" "Abb cd xy" "C123"     
#> [7] "B"

In the code above we used the regex ", ?" to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim. For example, if we had used "," in str_split we could do the following:

str_split(x, ",") %>% 
  map(str_trim) %>% 
  map(unique)

CodePudding user response：

You need to start matching from a char other than a whitespace and a comma, then optionally match any zero or more chars other than a comma up to a char other than whitespace and a comma:

str_extract_all(x, "([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")

See the regex demo and an R demo online. Mind that if your strings contain line breaks, you need to prepend the pattern with (?s): str_extract_all(x, "(?s)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)").

If you need to make it case insensitive (e.g. Abb cd xy and ABB cD Xy are considered duplicates), add the i flag: str_extract_all(x, "(?i)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") (or str_extract_all(x, "(?si)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)") if the DOTALL behavior is needed).

Details:

([^\s,](?:[^,]*[^\s,])?) - Group 1:
- [^\s,] - a char other than whitespace and a comma
- (?:[^,]*[^\s,])? - an optional sequence of
  - [^,]* - zero or more chars other than a comma
  - [^\s,] - a char other than whitespace and a comma
(?!.*\1) - a negative lookahead that fails the match if there are zero or more chars, as many as possible, followed with the Group 1 value.

CodePudding user response：

Not exactly what you asked for, but the NLP frameworks can be helpful when the problems get more complex.

library(tidytext)
library(dplyr)
library(tibble)

tibble(text = x) %>% 
  rowid_to_column("stringid") %>% 
  unnest_regex(substring, text, pattern = ",", to_lower = FALSE) %>% 
  distinct(stringid, substring = trimws(substring))

# # A tibble: 7 x 2
#   stringid substring
#      <int> <chr>    
# 1        1 Anna & x 
# 2        2 Alb      
# 3        2 Berta 222
# 4        3 Al Pacino
# 5        4 Abb cd xy
# 6        4 C123     
# 7        4 B

CodePudding user response：

Just add lapply(..., str_trim) to your code:

library(stringr)
lapply(str_extract_all(x, "([^,] )(?!.*\\1)"), str_trim)

[[1]]
[1] "Anna & x"

[[2]]
[1] "Berta 222" "Alb"      

[[3]]
[1] "Al Pacino"

[[4]]
[1] "Abb cd xy" "C123"      "B"