Hello everyone I hope you guys are having a good one,
I have multiple and long strings of text in a dataset, I am trying to capture all text between , after and before a set of words, I will refere to this words as keywords
keywords= UHJ, uhj, AXY, axy, YUI, yui, OPL, opl, UJI, uji
if I have the following string:
UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI
The following regex will easily match my keywords:
UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji
but since I am interested in capturing eveyrthing in between after and before those words, I am in some way wanting to capture the invert of my regex so that I can have something like this:
I have tried the following:
[^UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji]
with no luck and in the future the keywords may change so please if you know a regex that would work in R that can achive my desired output
CodePudding user response:
The simplest solution is probably just to split by your pattern. (Note this includes an empty string if the text starts with a keyword.)
x <- "UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI"
strsplit(x, "UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji")
# [[1]]
# [1] "" " This is only a test to "
# [3] " check regex in a " " educational context so "
# [5] " please be kind "
Other options would be to use regmatches()
with invert = TRUE
. (This includes empty strings if the text starts or ends with a keyword.)
regmatches(
x,
gregexpr("UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji", x, perl = TRUE),
invert = TRUE
)
# [[1]]
# [1] "" " This is only a test to "
# [3] " check regex in a " " educational context so "
# [5] " please be kind "
Or stringr::str_extract_all()
with your pattern in both a lookbehind and a lookahead. (This doesn't include empty strings.)
library(stringr)
str_extract_all(
x,
"(?<=UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji). ?(?=UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji)"
)
# [[1]]
# [1] " This is only a test to " " check regex in a "
# [3] " educational context so " " please be kind "