I'm looking for a way to extract from a dataframe column several words before, between and after several patterns...
Let's focus on the first two rows for instance:
[1] "keyword1 / keyword2 [keyword3 - keyword4 - keyword5 - keyword6][keyword7]"
[2] "keyword1[keyword3]"
how could I extract all those keywords and store them in a dataframe or a list ?
So far I tried this (which is far from enough...)
library(stringr)
a = "keyword1 / keyword2 [keyword3 - keyword4 - keyword5 - keyword6][keyword7]"
str_extract(a, "[^/] ")
str_extract_all(a,"(?<=/). (?= \\[)")
str_extract_all(a,"(?<=\\[). (?= \\/)")
str_extract_all(a,"(?<=\\]\\[). (?=\\])")
CodePudding user response:
Since you plan to extract all words from a string you can simply use
str_extract_all(a, '\\w ')
If you plan to only match words that contain letters or digits, then you need to subtract connector punctuation, diacritics and some more chars from \w
pattern (note that, in stringr
with ICU regex engine behind the scenes, \w
= [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]
):
str_extract_all(a, '[\\p{Alphabetic}\\p{Decimal_Number}] ')
You may require word boundaries on both ends of the "word" in the latter case with \b
:
str_extract_all(a, '\\b[\\p{Alphabetic}\\p{Decimal_Number}] \\b')
Or a shorter
str_extract_all(a, '\\b[\\p{L}\\d] \\b')
You may further experiment with the pattern to fit it your idea of what a "word" is.