Home > Back-end >  Regular Expression misses matches in string
Regular Expression misses matches in string

Time:07-04

I'm trying to write a regular expression that captures desired strings between strings ("f38 ","f38 ","f1 ", "..") and ("\par","\hich","{","}","","..") from a decompiled DOC file and append each match to an array to eventually be printed out into a new file.

I'm having an issue with catching certain strings between "f38 " and "\hich" (usually when the string spans multiple lines but there is at least 1 exception to this I've found in the example string snippet of the DOC file I'm using on regex101.com)

Here is the regular expression as I have it now

(?<=f38  |f38 | |f1 |\.\.)\w. (?=\\par|\\cell |\\hich|{|}|\\|\.\.)

The troublesome matches come out including "\hich". Like "e\hich" and "d\hich" and I want to match "e" and "d" respectively in these examples not the \hich portion. I'm thinking the problem is with handling the newline/line-breaks somehow.

Here is a smaller snippet of the input string, I have bolded what is matched and bolded capitalized the problematic match. From this I want the "e" not the \hich. Note that above there are 2 examples of things going right and "\hich" is not included in the match.

l\hich\af38\dbch\af31505\loch\f38 ..ikely to involve asbestos exposure: removal, encapsulation, alteration, repair, maintenance, insulation, spill/emergency clean-up, transportation, disposal and storage of ACM. The general industry standards cover all other operations where exposure to asb..\hich\af38\dbch\af31505\loch\f38 E\HICH\af38\dbch\af31505\loch\f38 stos is possible

Here is an example with a longer portion of the input string at regex101.com

Any help would be appreciated. Thanks!

CodePudding user response:

The problem is with the part you want to match those single-character samples. \w. requires at least two characters to match. So, for when you get "e\hich" that first backslash get matched to the dot in regex and lasts until the next backslash (which is one of the "terminators" listed in the positive lookahead portion of the regex).

You might want to use * instead of .

  • Related