I'm almost done with all my regex stuff but i encounter another problem, i have this regex :
(?=.*\d)[A-Z0-9]{5,}
It captures all stuff i need as :
AP51711
And it works but sometimes it has a strange behaviour, as far as i understood regex (i'm noob :p ) my regex is supposed to capture things that contains at least one DIGIT !
But on this string :
3M BUFFING MACHINE P64392
The output will be :
['BUFFING', 'MACHINE', 'P64392']
I don't understand why 'BUFFING' and 'MACHINE' are captured :O
If someone could help me understand this, thanks !
CodePudding user response:
if you do that:
(?=[A-Z]*\d)[A-Z0-9]{5,}
you have the result waited...
CodePudding user response:
You do not really need a regex here
sentence = "3M BUFFING MACHINE P64392"
words_with_digits = [word
for word in sentence.split()
if any(char.isdigit() for char in word)]
print(words_with_digits)
This will yield
['3M', 'P64392']
CodePudding user response:
Try this:
(?<=^|)(?=[^ ]*\d)[^ ]
Code:
pattern = r'(?<=^|)(?=[^ ]*\d)[^ ] '
text = "3M BUFFING MACHINE P64392"
result = re.findall(pattern, text)
print(result)
CodePudding user response:
You get a match for BUFFING
and MACHINE
because the pattern (?=.*\d)[A-Z0-9]{5,}
asserts that from the current position there should be a digit somewhere to the right of the line.
If that assertion is true, match 5 or more times any character of the ranges A-Z and 0-9.
What you might also do is start with a word boundary to prevent a partial word match so that the lookahead does not fire on every position when scanning for a match.
Then assert 5 chars out of the accepted characters, and if that assertion is true, match at least a single digit.
Without mixing \d
and [0-9]
:
\b(?=[A-Z\d]{5})[A-Z]*\d[A-Z\d]*
See a regex demo.