I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:
x <- c("Anna & x, Anna & x", #
"Alb, Berta 222, Alb",
"Al Pacino",
"Abb cd xy, Abb cd xy, C123, C123, B")
I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:
library(stringr)
str_extract_all(x, "([^,] )(?!.*\\1)")
[[1]]
[1] " Anna & x"
[[2]]
[1] " Berta 222" " Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] " Abb cd xy" " C123" " B"
How can the pattern be refined so that no unwanted whitespace gets extracted?
Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb" "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123" "B"
EDIT:
Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)
str_extract_all(x, "((?!\\s)[^,] )(?!.*\\1)")
CodePudding user response:
Change your pattern to the one below:
str_extract_all(x, "(\\b[^,] )(?!.*\\1)")
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"
CodePudding user response:
You can use str_split
to get the individual substrings, followed by unique
to remove repeated strings. For example:
library(tidyverse)
str_split(x, ", ?") %>% map(unique)
#> [[1]]
#> [1] "Anna & x"
#>
#> [[2]]
#> [1] "Alb" "Berta 222"
#>
#> [[3]]
#> [1] "Al Pacino"
#>
#> [[4]]
#> [1] "Abb cd xy" "C123" "B"
If you want the output as a single vector of unique strings, you could do:
str_split(x, ", ?") %>% unlist %>% unique
#> [1] "Anna & x" "Alb" "Berta 222" "Al Pacino" "Abb cd xy" "C123"
#> [7] "B"
In the code above we used the regex ", ?"
to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim
. For example, if we had used ","
in str_split
we could do the following:
str_split(x, ",") %>%
map(str_trim) %>%
map(unique)
CodePudding user response:
You need to start matching from a char other than a whitespace and a comma, then optionally match any zero or more chars other than a comma up to a char other than whitespace and a comma:
str_extract_all(x, "([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
See the regex demo and an R demo online. Mind that if your strings contain line breaks, you need to prepend the pattern with (?s)
: str_extract_all(x, "(?s)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
.
If you need to make it case insensitive (e.g. Abb cd xy
and ABB cD Xy
are considered duplicates), add the i
flag: str_extract_all(x, "(?i)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
(or str_extract_all(x, "(?si)([^\\s,](?:[^,]*[^\\s,])?)(?!.*\\1)")
if the DOTALL behavior is needed).
Details:
([^\s,](?:[^,]*[^\s,])?)
- Group 1:[^\s,]
- a char other than whitespace and a comma(?:[^,]*[^\s,])?
- an optional sequence of[^,]*
- zero or more chars other than a comma[^\s,]
- a char other than whitespace and a comma
(?!.*\1)
- a negative lookahead that fails the match if there are zero or more chars, as many as possible, followed with the Group 1 value.
CodePudding user response:
Not exactly what you asked for, but the NLP frameworks can be helpful when the problems get more complex.
library(tidytext)
library(dplyr)
library(tibble)
tibble(text = x) %>%
rowid_to_column("stringid") %>%
unnest_regex(substring, text, pattern = ",", to_lower = FALSE) %>%
distinct(stringid, substring = trimws(substring))
# # A tibble: 7 x 2
# stringid substring
# <int> <chr>
# 1 1 Anna & x
# 2 2 Alb
# 3 2 Berta 222
# 4 3 Al Pacino
# 5 4 Abb cd xy
# 6 4 C123
# 7 4 B
CodePudding user response:
Just add lapply(..., str_trim)
to your code:
library(stringr)
lapply(str_extract_all(x, "([^,] )(?!.*\\1)"), str_trim)
[[1]]
[1] "Anna & x"
[[2]]
[1] "Berta 222" "Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] "Abb cd xy" "C123" "B"