Home > Back-end >  Using Regex to NOT match a given pattern
Using Regex to NOT match a given pattern

Time:10-22

I'm writing a script part of which is isolating part numbers from emails. I have this Regex which is helping isolate the part numbers:

\b(\S[A-Z0-9/-]{3,30})\b

It works perfectly, except it also gives back phone numbers. Many part numbers might look familiar to a phone number, so changing that Regex is not likely. What I want to do is write something similar to "Matches \b(\S[A-Z0-9/-]{3,30})\b with the exception of \d\d\d-\d\d\d-\d\d\d\d", but I'm having trouble finding any Regex tokens which would give me that exception or do not match. Lookaheads are unlikely to work because there's nothing consistent I can give it to look for ahead or behind the phone number. Below is an example email I've been working with on Regex101 to test if it will work. Thank you in advance for any help or ideas.

this is an email, the part numbers are AB-CDE-FGHIJK and 3577/GFGFGF. my phone number is 585-555-6533 but i don't want that! fix it.

CodePudding user response:

I think this will work but I haven't tested it thoroughly:

(?<![\w/-])(?!\d\d\d-\d\d\d-\d\d\d\d(?![\w/-]))(\S[A-Z0-9/-]{3,30})(?![\w/-])

A lot of the noise in that regex is replacing \b with the more precise (?![\w/-]) (i.e., not followed by a word character, / or -), or the negative lookbehind version. If you just use \b you'll get spurious matches in the middle of a phone number, because - is not a word character. You may well need to adjust that pattern, depending on your precise needs (my phone number, for example, does not fit into the North American Numbering Zone format, since I live in Peru).

Other than that precision, the basic idea is to first check that the target substring does not match a phone number followed by a non-phone-number character, and then try to match the serial number format. Doing two checks at each point will slow things down a bit, but probably not too much.

I recommend collecting a large variety of test cases, trying to anticipate all possible issues (including serial numbers which include things which look like telephone numbers) and verifying that the regular expression works as expected on all of them.

CodePudding user response:

If I'm understanding you correctly, you want to match AB-CDE-FGHIJK and 82827/djdjd only excluding the the third one, if so, you don't have to exclude anything. assure you don't match the wrong part

\b((([A-Z] -) [A-Z] )|(\d /\w ))\b
  • Related