Home > Back-end >  Extract existing and missing left-hand collocates of a word
Extract existing and missing left-hand collocates of a word

Time:12-06

I'm struggling to extract both existing and missing left-hand collocates of a word such as "like" if "like" is the first word in a string:

test_string = c("like like like lucy she likes it and she's always liked it.")

Using str_extract_all and the negative character class \\S I'm getting close - but not close enough (the "l" of the second collocate is curiously omitted):

library(stringr)
unlist(str_extract_all(test_string, "(^|\\S )(?=\\s?\\blike\\b)"))
[1] ""     "ike"  "like"

Using this pattern I miss out on the missing collocate:

unlist(str_extract_all(test_string, "('?\\b[a-z'] \\b|^)(?=\\s?\\blike\\b)"))
[1] "like" "like"

The correct result would be this: ("" stands for the missing collocate of the string-initial "like")

[1] ""     "like"  "like"

I'm wondering, where's the mistake here? How can the extraction be improved?

CodePudding user response:

You could make use of an alternation | to get a position at the start of the string and for the matches using a lookbehind assertion with a finite quantifier:

  • ^ Start of string (this is the position)
  • (?=like\b) Positive lookahead, assert like followed by a word boundary directly to the right
  • | Or
  • (?<= Positive lookbehind
    • ^ Start of string
    • (?:like\s{1,2}){0,100} Repeat using a finite quantifier matching like followed by whitespace chars (also followed by a finite quantifier)
  • ) Close lookbehind
  • like\b Match like and a word boundary

Regex demo | R demo

Example

test_string = c("like like like lucy she likes it and she's always liked it.")
library(stringr)
unlist(str_extract_all(test_string, "^(?=like\\b)|(?<=^(?:like\\s{1,2}){0,100})like\\b"))

Output

[1] ""     "like" "like"
  • Related