I'm struggling to extract both existing and missing left-hand collocates of a word such as "like" if "like" is the first word in a string:
test_string = c("like like like lucy she likes it and she's always liked it.")
Using str_extract_all
and the negative character class \\S
I'm getting close - but not close enough (the "l" of the second collocate is curiously omitted):
library(stringr)
unlist(str_extract_all(test_string, "(^|\\S )(?=\\s?\\blike\\b)"))
[1] "" "ike" "like"
Using this pattern I miss out on the missing collocate:
unlist(str_extract_all(test_string, "('?\\b[a-z'] \\b|^)(?=\\s?\\blike\\b)"))
[1] "like" "like"
The correct result would be this: (""
stands for the missing collocate of the string-initial "like")
[1] "" "like" "like"
I'm wondering, where's the mistake here? How can the extraction be improved?
CodePudding user response:
You could make use of an alternation |
to get a position at the start of the string and for the matches using a lookbehind assertion with a finite quantifier:
^
Start of string (this is the position)(?=like\b)
Positive lookahead, assertlike
followed by a word boundary directly to the right|
Or(?<=
Positive lookbehind^
Start of string(?:like\s{1,2}){0,100}
Repeat using a finite quantifier matching like followed by whitespace chars (also followed by a finite quantifier)
)
Close lookbehindlike\b
Match like and a word boundary
Example
test_string = c("like like like lucy she likes it and she's always liked it.")
library(stringr)
unlist(str_extract_all(test_string, "^(?=like\\b)|(?<=^(?:like\\s{1,2}){0,100})like\\b"))
Output
[1] "" "like" "like"