Regex pattern to match a certain type of substring (having characters in a certain order)-CodePudding

E.g. The input string is Hi a14457h11 what is up?

I just want to extract a14457h11 from the sentence.

Input could also be Hi a14457h11 what is up? I am b14457h12 good. That is a14457h11 great

Here the extracted output would be: a14457h11 and b14457h12 (the third one is repeated).

It will always be in the same format: 1 alphabet followed by 5 digits followed by 1 alphabet and 2 more digits (in the same order).

I can do it in raw python, but is there a way to do it in regex as well?

CodePudding user response：

You can use this pattern: [a-z][0-9]{5}[a-z][0-9]{2}

import re
s = """"Hi a14457h11 what is up? I am b14457h12 good. That is b14457h12 great"""
regex = re.compile(r'[a-z][0-9]{5}[a-z][0-9]{2}')
l = regex.findall(s)
print(l)
# ['a14457h11', 'b14457h12', 'b14457h12']

If the letters in your input are case sensitive you will have to change the char classes([a-z]) accordingly.

Explanation:

[a-z]    matches one lower case letter
[0-9]{5} matches five numbers
[a-z]    matches one lower case letter
[0-9]{2} matches two numbers

Test here: https://regex101.com/r/XHzITt/1

CodePudding user response：

The idea being, it will always be in the same format: 1 alphabet followed by 5 digits followed by 1 alphabet and 2 more digits (in the same order)

Since it always follows the same order. We could set up our regex with a rule based on the below.


r"(\w\d{5}\w\d{2})"

Explaination

\w matches any word character group
\d matches a digit (equivalent to [0-9])
{5} matches the previous token exactly 5 times
\w matches any word character group
\d matches a digit (equivalent to [0-9])
{2} matches the previous token exactly 2 times

ref: regex101

Edit1: As pointed out by @anotherGatsby since the \w matches [a-zA-Z0-9_], we could also use [abc] which matches a single character in the list abc (case sensitive) as an alternative.

CodePudding user response：

import re
s = "Hi a14457h11 what is up? I am b14457h12 good. That is b14457h12 great"
regex = re.compile(r'[a-z]\d{5}[a-z]\d\d')
f = set(regex.findall(s))
print(f)
# {'b14457h12', 'a14457h11'}