Home > Software engineering >  find a substring pattern followed by one substring but not by another substring
find a substring pattern followed by one substring but not by another substring

Time:03-26

I have hospital data (opcs, NHS) which comprises of procedure codes followed by a code to indicate laterality.

Using Regex and R, I would like to identify a procedure code in a string which is followed other procedure codes then the laterality code.

However the match must not include procedure codes of intrest, which are followed by a different laterality code. Example:

string <- ("W100 Z923 W200 A456 W200 B234 A234 Z921")

What I am trying to match:"W100|W200"

What it must be followed by: "Z921" e.g. Should match this W200 B234 A234 Z921

But must not be followed by: "Z922|Z923" e.g. Should not match this W100 Z923 W200 A456 W200 B234 A234 Z921

What I have tried:

#match the procedure follow by Z921: 
(W100|W200).{1,}?Z941 

# I do not know how to add a negative look back to exclude matches without stopping this working, I have tried this, but it fails:
((W100|W200).{1,}Z941) (?<!Z943|Z942)

edit: Improved the clarity of question and reprex

CodePudding user response:

You can use

library(stringr)
str_extract_all(x, "\\bW[12]00\\b(?!\\s Z92[23]\\b).*?Z941")

See the regex demo. Details:

  • \b - a word boundary
  • W[12]00 - W100 or W200
  • \b - a word boundary
  • (?!\s Z92[23]\b) - a negative lookahead that fails the match if there are zero or more whitespaces and then Z923 or Z922 as a whole word
  • .*? - any zero or more chars, other than line break chars, as few as possible
  • Z941 - a Z941 string.
  • Related