Why a pattern works for str_extract_all but does not work for strsplit nor str

Here is my data, a string:

data <- "Mr. NAME. Content1.Mrs. NAMEE. Content2.Ms. NAME ABCD. Content3."

I get a vector of names by the syntax below:

name <- unlist(str_extract_all( text, "Mr\\.\\s[:upper:]{1,20}\\s?[:upper:]{1,20}\\.|Ms\\.\\s[:upper:]{1,20}\\s?[:upper:]{1,20}\\.|Mrs\\.\\s[:upper:]{1,20}\\s?[:upper:]{1,20}\\." ))

I get what I want:

name [1] "Mr. BOOKER." "Mr. COMER." "Mr. BAIRD." "Mrs. KIRKPATRICK."
[5] "Ms. CORTEZ MASTO." "Ms. ROSEN." "Mrs. HAYES." "Ms. SHALALA."
[9] "Mr. DEUTCH." "Mr. KENNEDY." "Mr. KRISHNAMOORTHI." "Mr. SOTO."
[13] "Mr. SOTO." "Mrs. DEMINGS." "Mr. SOTO." "Mr. CICILLINE."
[17] "Mr. SOTO." "Ms. WASSERMAN SCHULTZ." "Mr. SOTO." "Ms. WASSERMAN SCHULTZ."

How can I get a vector of the content between the names. I want a vector like this:

"Content1."   "Content2."    "Content3."

I tried str_subset and strsplit to get the content between the pattern that I define in str_extract function, failed again and again......

CodePudding user response：

You can really use your regex with stringr::str_split. However, it makes sense to condense the alternatives into

pattern <- "\\bM(?:rs?|s)\\.\\s\\p{Lu}{1,20}\\s?\\p{Lu}{1,20}\\."

Ms, Mr and Mrs can be joined into M(?:rs?|s) pattern (M, then either r and an optional s or just s).

Now, you can use this pattern with stringr::str_split:

pattern <- "\\bM(?:rs?|s)\\.\\s\\p{Lu}{1,20}\\s?\\p{Lu}{1,20}\\."
library(stringr)
str_split(data,pattern)
# => [[1]]
#    [1] ""           " Content1." " Content2." " Content3."

Why is there an empty string at the start? It is there because you have a match at the start of the string. When splitting, matched texts are removed from the char sequences, and the text before and after is put into separate items. When the match is at the start, the first item is the empty string. The same happens when the match is at the end of the string, or when there are consecutive matches.

If you do not want to have empty items in the output, simply remove them:

pattern <- "\\bM(?:rs?|s)\\.\\s\\p{Lu}{1,20}\\s?\\p{Lu}{1,20}\\."
library(stringr)
result <- str_split(data,pattern)
lapply(result, function(x) x[x!=""])
# => [[1]]
#    [1] " Content1." " Content2." " Content3."

CodePudding user response：

A simple way to obtain the results you want is to collapse your regex to a token using string_replace_all, then split by that token:

#Dropping many alternative patterns, 
pattern = "Mr\\.\\s[:upper:]{1,20}\\s?[:upper:]{1,20}\\." #etc etc
name <- unlist(stringr::str_replace_all( data, pattern, "xyx"))
stringr::str_split(name, "xyx")

Yields

""           " Content1." " Content2." " Content3."