Home > Mobile >  Inverting a regex pattern in R to match all left and between a giving set of strings
Inverting a regex pattern in R to match all left and between a giving set of strings

Time:12-22

Hello everyone I hope you guys are having a good one,

I have multiple and long strings of text in a dataset, I am trying to capture all text between , after and before a set of words, I will refere to this words as keywords

keywords= UHJ, uhj, AXY, axy, YUI, yui, OPL, opl, UJI, uji

if I have the following string:

UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI

The following regex will easily match my keywords:

UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji

enter image description here

but since I am interested in capturing eveyrthing in between after and before those words, I am in some way wanting to capture the invert of my regex so that I can have something like this:

enter image description here

I have tried the following:

[^UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji]

enter image description here

with no luck and in the future the keywords may change so please if you know a regex that would work in R that can achive my desired output

CodePudding user response:

The simplest solution is probably just to split by your pattern. (Note this includes an empty string if the text starts with a keyword.)

x <- "UHJ This is only a test to AXY check regex in a YUI educational context so OPL please be kind UJI"

strsplit(x, "UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji")
# [[1]]
# [1] ""                         " This is only a test to "
# [3] " check regex in a "       " educational context so "
# [5] " please be kind "      

Other options would be to use regmatches() with invert = TRUE. (This includes empty strings if the text starts or ends with a keyword.)

regmatches(
  x,
  gregexpr("UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji", x, perl = TRUE),
  invert = TRUE
)
# [[1]]
# [1] ""                         " This is only a test to "
# [3] " check regex in a "       " educational context so "
# [5] " please be kind "      

Or stringr::str_extract_all() with your pattern in both a lookbehind and a lookahead. (This doesn't include empty strings.)

library(stringr)

str_extract_all(
  x, 
  "(?<=UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji). ?(?=UHJ|uhj|AXY|axy|YUI|yui|OPL|opl|UJI|uji)"
)
# [[1]]
# [1] " This is only a test to " " check regex in a "      
# [3] " educational context so " " please be kind "   
  • Related