Apply a look ahead in regex that should be followed by the specified pattern and give a match other-CodePudding

Hello i am new to regex , i needed to apply a regex to a string of us zip codes , which we got from concatenating rows of pandas columns

for example zip being header of the column

zip
you have some thing
70456
90876
78905

we get the string zip you have some thing 70456 90876 78905 as single literal string which should be matched by the regex that has some characters followed by one or more 5 digits separated by empty space so i wrote a simple regex of '.*zip.*(\d{5}|\s)*' a zip followed by any number of 5 digit characters but it gives a match(re.fullmatch) zip 123456 a zip which is followed by a 6 digit code

for that reason i thought of using look ahead assertion in regex, but i am not able to know how to use it exactly it not giving any matches , i used look behind with re.search also but it also seems to fail , can some one give a regex having word zip and also only a 5 digit characters at the end may be a nan

here are the codes i have written

re.match('(?=zip)(\d{5}|\s)*','zip 123456')

<re.Match object; span=(0, 0), match=''>

re.search('(?<=zip)(\d{5}|\s)*','zip 123456')

<re.Match object; span=(3, 9), match=' 12345'>

can some one tell me how to write a regex for if .zip. follwed by digits having only 5 digits give a match else None

re.match('(?=zip)(\d{5}|\s)*','zip 123456') re.search('(?<=zip)(\d{5}|\s)*','zip 123456')

those are the codes i have tried i need a regex having any alphanumeric charcters that contain zip followed by a 5 digit numeric code

CodePudding user response：

I suggest using word-boundary (\b) as follows

import re
t1 = 'zip 1234' # less than 5, should not match
t2 = 'zip 12345'  # should match
t3 = 'zip 123456'  # more than 5, should not match
pattern = r'zip\s\d{5}\b'
print(re.search(pattern, t1))  # None
print(re.search(pattern, t2))  # <re.Match object; span=(0, 9), match='zip 12345'>
print(re.search(pattern, t3))  # None

\b is zero-length assertion useful to make sure you have complete word rather than just part. See re docs for details of \b operations.

CodePudding user response：

You can use

re.search(r'\bzip\b.*?\d{5}(?:\s \d{5})*\b', text)

See the regex demo. If you want to also capture the ZIPs, you can use a capturing group:

re.search(r'\bzip\b.*?(\d{5}(?:\s \d{5})*)\b', text)

See this regex demo.

Details:

\b - a word boundary
zip - a zip string
\b - a word boundary
.*? - zero or more chars other than line break chars as few as possible
\d{5} - five digits
(?:\s \d{5})* - zero or more sequences of one or more whitespaces and then five digits
\b - a word boundary