Home > Software engineering >  How to extract strings before, between and after different patterns with r?
How to extract strings before, between and after different patterns with r?

Time:09-27

I'm looking for a way to extract from a dataframe column several words before, between and after several patterns...

Let's focus on the first two rows for instance:

[1] "keyword1 / keyword2 [keyword3 - keyword4 - keyword5 - keyword6][keyword7]"
[2] "keyword1[keyword3]"

how could I extract all those keywords and store them in a dataframe or a list ?

So far I tried this (which is far from enough...)

library(stringr)

a = "keyword1 / keyword2 [keyword3 - keyword4 - keyword5 - keyword6][keyword7]"

str_extract(a, "[^/] ")

str_extract_all(a,"(?<=/). (?= \\[)")

str_extract_all(a,"(?<=\\[). (?= \\/)")

str_extract_all(a,"(?<=\\]\\[). (?=\\])")

CodePudding user response:

Since you plan to extract all words from a string you can simply use

str_extract_all(a, '\\w ')

If you plan to only match words that contain letters or digits, then you need to subtract connector punctuation, diacritics and some more chars from \w pattern (note that, in stringr with ICU regex engine behind the scenes, \w = [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]):

str_extract_all(a, '[\\p{Alphabetic}\\p{Decimal_Number}] ')

You may require word boundaries on both ends of the "word" in the latter case with \b:

str_extract_all(a, '\\b[\\p{Alphabetic}\\p{Decimal_Number}] \\b')

Or a shorter

str_extract_all(a, '\\b[\\p{L}\\d] \\b')

You may further experiment with the pattern to fit it your idea of what a "word" is.

  • Related