Python regex negative lookahead matching where it shouldn't-CodePudding

Example first:

import re

details = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
input_re = re.compile(r'(?!output[0-9]*) mem([0-9a-f] )')
print(input_re.findall(details))
# Out: ['001', '005', '002', '006']

I am using negative lookahead to extract the hex part of the mem entries that are not preceded by an output, however as you can see it fails. The desired output should be: ['001', '002'].

What am I missing?

CodePudding user response：

You may use this regex in findall:

\b(?!output\d )\w \s mem([a-zA-F\d] )

RegEx Demo

RegEx Details:

\b: Word boundary
(?!output\d ): Negative lookahead to assert that we don't have output and 1 digits ahead
\w : Match 1 word characters
\s : Match 1 whitespaces
mem([a-zA-F\d] ): Match mem followed by 1 of any hex character

Code:

import re
s = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
print( re.findall(r'\b(?!output\d )\w \s mem([a-zA-F\d] )', s) )

Output:

['001', '002']

CodePudding user response：

Maybe an easier approach is to split it up in 2 regular expressions ? First filter out anything that starts with output and is followed by mem like so

output[0-9]* mem([0-9a-f] )

If you filter this out it would result in

input1 mem001 data2 mem002

When you have filtered them out just search for mem again

mem([0-9a-f] )

That would result in your desired output

['001', '002']

Maybe not an answer to the original question, but it is a solution to your problem

CodePudding user response：

First of all, let's understand why your original regex doesn't work:

A regex encapsulates two pieces of information: a description of a location within a text, and a description of what to capture from that location. Your original regex tells the regex matcher: "Find a location within the text where the following characters are not 'output' digits but they are ' mem' alphanumetics". Think of the logic of that expression: if the matcher finds a location in the text where the following characters are ' mem' alphanumerics, then, in particular, the following characters are not 'output' digits. Your look ahead does not add anything to the exoression.

What you really need is to tell the matcher: "Find a location in the text where the following characters are ' mem' alphanumerics, and the previous characters are not 'output' digits. So what you really need is a look-behind, not look-ahead.

@ArtyomVancyan proposed a good regex with a look-behind, and it could easily be modified to what you need: instead of a single digit after the 'output', you want potentially more digits, so just put an asterisk (*) after the '\d'.