Hello i am new to regex , i needed to apply a regex to a string of us zip codes , which we got from concatenating rows of pandas columns
for example zip being header of the column
zip
you have some thing
70456
90876
78905
we get the string zip you have some thing 70456 90876 78905
as single literal string which should be matched by the regex that has some characters followed by one or more 5 digits separated by empty space
so i wrote a simple regex of '.*zip.*(\d{5}|\s)*'
a zip followed by any number of 5 digit characters but it gives a match(re.fullmatch) zip 123456 a zip which is followed by a 6 digit code
for that reason i thought of using look ahead assertion in regex, but i am not able to know how to use it exactly it not giving any matches , i used look behind with re.search also but it also seems to fail , can some one give a regex having word zip and also only a 5 digit characters at the end may be a nan
here are the codes i have written
re.match('(?=zip)(\d{5}|\s)*','zip 123456')
<re.Match object; span=(0, 0), match=''>
re.search('(?<=zip)(\d{5}|\s)*','zip 123456')
<re.Match object; span=(3, 9), match=' 12345'>
can some one tell me how to write a regex for if .zip. follwed by digits having only 5 digits give a match else None
re.match('(?=zip)(\d{5}|\s)*','zip 123456')
re.search('(?<=zip)(\d{5}|\s)*','zip 123456')
those are the codes i have tried i need a regex having any alphanumeric charcters that contain zip followed by a 5 digit numeric code
CodePudding user response:
I suggest using word-boundary (\b
) as follows
import re
t1 = 'zip 1234' # less than 5, should not match
t2 = 'zip 12345' # should match
t3 = 'zip 123456' # more than 5, should not match
pattern = r'zip\s\d{5}\b'
print(re.search(pattern, t1)) # None
print(re.search(pattern, t2)) # <re.Match object; span=(0, 9), match='zip 12345'>
print(re.search(pattern, t3)) # None
\b
is zero-length assertion useful to make sure you have complete word rather than just part. See re
docs for details of \b
operations.
CodePudding user response:
You can use
re.search(r'\bzip\b.*?\d{5}(?:\s \d{5})*\b', text)
See the regex demo. If you want to also capture the ZIPs, you can use a capturing group:
re.search(r'\bzip\b.*?(\d{5}(?:\s \d{5})*)\b', text)
See this regex demo.
Details:
\b
- a word boundaryzip
- azip
string\b
- a word boundary.*?
- zero or more chars other than line break chars as few as possible\d{5}
- five digits(?:\s \d{5})*
- zero or more sequences of one or more whitespaces and then five digits\b
- a word boundary