Python regex for sequence containing at least two digits/letters-CodePudding

using the Python module re, I would like to detect sequences that contain at least two letters (A-Z) and at least two digits (0-9) from a text, e.g., from the text

"N03FZ467 other text N03671"

precisely the sub-string "N03FZ467" shall be matched.

The best I have got so far is

(?=[A-Z]*\d)[A-Z0-9]{4,}

which detects sequences of length at least 4 that contain only letters A-Z and digits 0-9, and at least one digit and one letter. How can I make sure I respectively get at least two?

CodePudding user response：

Use look aheads, one for each requirement:

^(?=(.*\d){2})(?=(.*[A-Z]){2}).*

See live demo.

Regex breakdown:

(?=(.*\d){2}) is "2 digits somewhere ahead"
(?=(.*[A-Z]){2}) is "2 letters somewhere ahead"

The more efficient version:

^(?=(?:.*?\d){2})(?=(?:.*?[A-Z]){2}).*

It's more efficient because it doesn't capture (uses non-capturing groups (?:...)) and it uses the reluctant quantifier .*? which matches as early as possible in the input, whereas .* will scan ahead to the end then backtrack to find a match.

CodePudding user response：

If you only want to match chars A-Z and 0-9 you can use a single lookahead (if supported) to make sure there are 2 digits present, and then match 2 times A-Z when matching the string.

As you have asserted 2 chars and matching 2 chars, then length is automatically at least 4 chars.

\b(?=[A-Z\d]*\d\d)[A-Z\d]*[A-Z]{2}[A-Z\d]*\b

Explanation

\b A word boundary to prevent a partial word match
(?=[A-Z\d]*\d\d) Positive lookahead, assert 2 digits to the right
[A-Z\d]* Match optional chars A-Z or digits
[A-Z]{2} Match 2 uppercase chars A-Z
[A-Z\d]* Match optional chars A-Z or digits
\b A word boundary

See a regex demo.

CodePudding user response：

I would enhance given answer and do this:

(?=\b(?:\D \d ){2}\b)(?=\b(?:[^a-z] [a-z] ){2}\b)\S

Regex demo

This contains two lookaheads, each validating one rule:

(?=\b(?:\D \d ){2}\b) - lookahead that asserts that what follows is word boundary \b, then its a non-digits followed by digits \D \d to determine that we have at least two such groups. Then words boundary again, two be sure we are within one "word".

Another look ahead is the same, but now isntead of digits and non digits we have letter [a-z] and non-letters [^a-z] - (?=\b(?:[^a-z] [a-z] ){2}\b)

At the end, we just match whole 'word' with \S which is simply match all non-whitespace characters (since we asserted earlier our 'word', this is sufficient).