Regex to capture word with at least one number in it-CodePudding

I'm almost done with all my regex stuff but i encounter another problem, i have this regex :

(?=.*\d)[A-Z0-9]{5,}

It captures all stuff i need as :

AP51711

And it works but sometimes it has a strange behaviour, as far as i understood regex (i'm noob :p ) my regex is supposed to capture things that contains at least one DIGIT !

But on this string :

3M BUFFING MACHINE P64392

The output will be :

['BUFFING', 'MACHINE', 'P64392']

I don't understand why 'BUFFING' and 'MACHINE' are captured :O

If someone could help me understand this, thanks !

CodePudding user response：

if you do that:

 (?=[A-Z]*\d)[A-Z0-9]{5,}

you have the result waited...

CodePudding user response：

You do not really need a regex here

sentence = "3M BUFFING MACHINE P64392"

words_with_digits = [word 
                     for word in sentence.split()
                     if any(char.isdigit() for char in word)]
print(words_with_digits)

This will yield

['3M', 'P64392']

CodePudding user response：

Try this:

(?<=^|)(?=[^ ]*\d)[^ ]

Code:

pattern = r'(?<=^|)(?=[^ ]*\d)[^ ] ' 
text = "3M BUFFING MACHINE P64392"
result = re.findall(pattern, text)
print(result)

CodePudding user response：

You get a match for BUFFING and MACHINE because the pattern (?=.*\d)[A-Z0-9]{5,} asserts that from the current position there should be a digit somewhere to the right of the line.

If that assertion is true, match 5 or more times any character of the ranges A-Z and 0-9.

What you might also do is start with a word boundary to prevent a partial word match so that the lookahead does not fire on every position when scanning for a match.

Then assert 5 chars out of the accepted characters, and if that assertion is true, match at least a single digit.

Without mixing \d and [0-9]:

\b(?=[A-Z\d]{5})[A-Z]*\d[A-Z\d]*

See a regex demo.