I have hospital data (opcs, NHS) which comprises of procedure codes followed by a code to indicate laterality.
Using Regex
and R
, I would like to identify a procedure code in a string which is followed other procedure codes then the laterality code.
However the match must not include procedure codes of intrest, which are followed by a different laterality code. Example:
string <- ("W100 Z923 W200 A456 W200 B234 A234 Z921")
What I am trying to match:"W100|W200"
What it must be followed by: "Z921"
e.g. Should match this W200 B234 A234 Z921
But must not be followed by: "Z922|Z923"
e.g. Should not match this W100 Z923 W200 A456 W200 B234 A234 Z921
What I have tried:
#match the procedure follow by Z921:
(W100|W200).{1,}?Z941
# I do not know how to add a negative look back to exclude matches without stopping this working, I have tried this, but it fails:
((W100|W200).{1,}Z941) (?<!Z943|Z942)
edit: Improved the clarity of question and reprex
CodePudding user response:
You can use
library(stringr)
str_extract_all(x, "\\bW[12]00\\b(?!\\s Z92[23]\\b).*?Z941")
See the regex demo. Details:
\b
- a word boundaryW[12]00
-W100
orW200
\b
- a word boundary(?!\s Z92[23]\b)
- a negative lookahead that fails the match if there are zero or more whitespaces and thenZ923
orZ922
as a whole word.*?
- any zero or more chars, other than line break chars, as few as possibleZ941
- aZ941
string.