Home > other >  Extract words/texts after a group of phrases in R
Extract words/texts after a group of phrases in R

Time:03-09

I'm trying to write a function to extract words that come before or after a group of phrases.

Extracting words that come after a single phrase, for example, item in a string variable called x, I had luck with the below code:

str_extract(x, pattern="(?<=item).*?(?=,)")

How do I pass on a list of phrases to look for onto a regex? For example, I want to create a list of phrases, called keywords and extract a group of words that come after these phrases. How do I tell regex keywords is a list, not a text?

keywords <- c("item", 
              "date",
              "size",
              "length")

CodePudding user response:

Your pattern must look like

paste0("(?<=", paste(keywords, collapse="|"),").*?(?=,)")
paste0("(?<=", paste(keywords, collapse="|"),")[^,]*")

The first pattern will look like (?<=item|date|size|length).*?(?=,). This matches a location that is immediately preceded with item, date, size or length, then consumes any zero or more chars other than line break chars, as few as possible, up to the leftmost occurrence of a comma without consuming it (as (?=,) is a positive lookahead).

The second regex will look like (?<=item|date|size|length)[^,]*, and will match similarly as above pattern. Note the difference though: [^,]* matches any zero or more chars other than a comma, so 1) it will match even if there is no comma later, and 2) it will match any chars including line break chars.

  • Related